







































































Que: Which of the following is tru about security audit?
Ans: A policy is always needed for performing a security audit.
Que: Which of the following is not a function of security audit?
Ans: Event maintenance.
Que: Which of the following is not a part of layered architecture of apache sentry?
Ans: Policy metadata.
Que: Which of the following is not true about Hive bucket?
Ans:Buckets are stored as subdirectories.
Que: Which of the following supports cross language service?
Ans: Thrift server.
Que: Which of the following maintains the life cycle of Hive QL statement?
Ans: Driver.
Que: We can use Hive's SerDe's function for:
Ans: All of the above(Unstructed data like audio and video, Semi structured data like XML and JSON, Structured data).
Que: Which of the following is correct ordering of solr Hierarchy?
Ans: Instance, Index, Document, Field.
































































































































































































![]()
![]()



![]()
![]()
![]()

APEX INSTITUTE OF TECHNOLOGY
DEPARTMENT: CSE
Bachelor of Engineering (Computer Science & Engineering)
Big Data Security
Introduction
To Big Data
DISCOVER. LEARN. EMPOWER
![]()



![]()

![]()

![]()
![]()
![]()
![]()

![]()




Course Objective
Students will try to learn:
1 | To understand the concept of Big Data and define security control with core disciplines |
2 | To monitor data usage for modelling real-world problems |
3 | To secure and protect data |
2

Big Data Security
Course Outcome
CO | Title | Level |
1 | Recognize all security related issues in big data systems and | Understand |
2 | Understand cryptographic principles and mechanisms to | Remember |
3 | Identify security risks and challenges for Big Data system. | Apply |

Data: Data is a set of values of qualitative or quantitative variables. It is
information in raw or unorganized form. It may be a fact, figure, characters,
symbols etc.
Information: Meaningful information or organized data is information
Analytics: Analytics is discovery, interpretation, and communication of
meaningful patterns or summary in data
Data Analytics (DA) is the process of examining data sets in order to draw
conclusion about the information it contains.
Analytics is not a tool or technology, rather it is the way of thinking and acting on
data.

Types of Analytics
Descriptive
Diagnostic
Predictive
Prescriptive

Types of Analytics
(Contd.)

Big Data
Definition - Big data is defined as collections of datasets whose
volume, velocity or variety is so large that it is difficult to store,
manage, process and analyze the data using traditional databases and
data processing tools. In the recent years, there has been an
exponential growth in the both structured and unstructured data
generated by information technology, industrial, healthcare, Internet
of Things, and other systems.

Characteristics of Big Data
Volume
Velocity
Variety
Veracity
Value






Volume (Scale)
Data Volume
44x increase from 2009 2020
From 0.8 zettabytes to 35zb
Data volume is increasing exponentially
Exponential increase in
collected/generated data


![]()







![]()


12+ TBs
of tweet data
30 billion RFID
tags today
(1.3B in 2005)
4.6
4.6
illion
b
amera
c
billion
hones
p
d wide
worl
phones
every day
worl
wide
00s of
1
100s of
lions
mi
millions
f GPS
o
abled
en
? TBs of
data every day
o
en
es sold
devic
devic
GPS
bled
s sold
25+ TBs of
log data
every day
76 million smart meters
in 2009…
200M by 2014
annually
2+
nually
an
2+
illion
b
billion
ople on
pe
e Web
th
by end
people on
the Web
by end
2011






![]()

![]()
![]()

![]()

![]()

![]()
![]()
![]()

![]()


![]()

![]()
![]()
![]()
![]()

![]()
![]()
![]()



![]()


![]()

![]()
![]()
![]()
![]()

![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
Graph Data
Social Network, Semantic Web (RDF), …
Streaming Data
You can only scan the data once
A single application can be generating/collecting many
types of data
Big Public Data (online, weather, finance, etc)
To extract knowledge all these types of
data need to linked together
11








![]()
A Single View to the Customer
Social
Media
Banking
Finance
Gaming
Customer
Our
Known
History
Entertain
Purchas
e


![]()
![]()


![]()

![]()
![]()
![]()
![]()
![]()
![]()


![]()
![]()


![]()
![]()
![]()
![]()
![]()
![]()
![]()

![]()
![]()
![]()

![]()

![]()
![]()

![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()

![]()
![]()
![]()
![]()
![]()


![]()
![]()

![]()
![]()
![]()
![]()
![]()
![]()

![]()
![]()
![]()
![]()

![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()

![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Velocity (Speed)
Data is begin generated fast and need to be processed fast
Online Data Analytics
Late decisions missing opportunities
Examples
E-Promotions: Based on your current location, your purchase history, what
you like send promotions right now for store next to you
Healthcare monitoring: sensors monitoring your activities and body any
abnormal measurements require immediate reaction
13


14

Assessment Pattern
S.No. | Item | Number/semester | Marks | System |
1 | MSTs | 2 | 36 (12 each) | Combined tests |
2 | Quiz | 2 | 4 | Once online |
3 | Surprise test | 1 | 12 | Teacher decides |
4 | Assignments | 3 (one per unit) | 10 | By teacher as per the dates specified |
5 | Tutorials | Depending on classes | 3 | In tutorial classes |
6 | Attendance | Above 90% | 2 | |
Internal (division as mentioned above points 1-6) | 40 | |||
External | 60 | |||
Total | 100 | |||
15

REFERENCES
Big Data Analytics A Hands-On Approach by Arshdeep Bahga, Vijay
Madisetti
Data Science and Big Data Analytics Discovering, Analyzing,
Visualizing and Presenting Data by EMC Education Services
Ben Spivey, Joey Echeverria - Hadoop Security: Protecting Your Big
Data Platform-O'Reilly Media (2015)
16

THANKYOU
For queries
![]()
APEX INSTITUTE OF TECHNOLOGY
DEPARTMENT: CSE
Bachelor of Engineering (Computer Science & Engineering)
Big Data Security
Introduction
To Big Data
DISCOVER. LEARN. EMPOWER
![]()



![]()

![]()

![]()
![]()
![]()
![]()

![]()




Course Objective
Students will try to learn:
1 | To understand the concept of Big Data and define security control with core disciplines |
2 | To monitor data usage for modelling real-world problems |
3 | To secure and protect data |
2

Big Data Security
Course Outcome
CO | Title | Level |
1 | Recognize all security related issues in big data systems and | Understand |
2 | Understand cryptographic principles and mechanisms to | Remember |
3 | Identify security risks and challenges for Big Data system. | Apply |

Data collection is the first step for any analytics application.
Before the data can be analyzed, the data must be collected and
ingested into a big data stack. The choice of tools and
frameworks for data collection depends on the source of data
and the type of data being ingested.
For data collection, various types of connectors can be used
such as publish-subscribe messaging frameworks, messaging
queues, source-sink connectors, database connectors and
custom connectors.

Data can often be dirty and can have various issues that must be
resolved before the data can be processed, such as corrupt
records, missing values, duplicates, inconsistent abbreviations,
inconsistent units, typos, incorrect spellings and incorrect
formatting.
Data preparation step involves various tasks such as data
cleansing, data wrangling or munging, de-duplication,
normalization, sampling and filtering.
Data cleaning detects and resolves issues such as corrupt
records, records with missing values, records with bad formatting


The next step in the analysis flow is to determine the analysis
type for the application.
In Figure, the various options for analysis types and the popular
algorithms for each analysis type are listed.

With the analysis types selected for an application, the next step is to
determine the analysis mode, which can be either batch, real-time or
interactive.
The choice of the mode depends on the requirements of the application.
If your application demands results to be updated after short intervals of time
(say every few seconds), then real-time analysis mode is chosen.
However if your application only requires the results to be generated and
updated on larger timescales (say daily or monthly), then batch mode can be
used.
If your application demands flexibility to query data on demand, then the
interactive mode is useful.
Once you make a choice of the analysis type and the analysis mode, you can

The choice of the visualization tools, serving databases and web frameworks
is driven by the requirements of the application.
Visualizations can be static, dynamic or interactive. Static visualizations are
used when you have the analysis results stored in a serving database and you
simply want to display the results.
However, if your application demands the results to updated regularly, then
you would require dynamic visualizations (with live widgets, plots, or gauges).
If you want your application to accept inputs from the user and display the
results,
then you would require interactive visualizations.

Big Data Patterns - Sharding

Consistency, Availability & Partition Tolerance
(CAP)


Consistency, Availability & Partition Tolerance
(CAP)
A consistent system is one in which all reads are guaranteed to incorporate
the previous writes.
Availability refers to the ability of the system to respond to all the queries
without being unavailable
Partition tolerance refers to the ability of the system to continue performing
its operations in the event of network partitions.
The CAP theorem states that the system can either favor consistency and
partition tolerance over availability, or favor availability and partition
tolerance over consistency

Assessment Pattern
S.No. | Item | Number/semester | Marks | System |
1 | MSTs | 2 | 36 (12 each) | Combined tests |
2 | Quiz | 2 | 4 | Once online |
3 | Surprise test | 1 | 12 | Teacher decides |
4 | Assignments | 3 (one per unit) | 10 | By teacher as per the dates specified |
5 | Tutorials | Depending on classes | 3 | In tutorial classes |
6 | Attendance | Above 90% | 2 | |
Internal (division as mentioned above points 1-6) | 40 | |||
External | 60 | |||
Total | 100 | |||
13

REFERENCES
Big Data Analytics A Hands-On Approach by Arshdeep Bahga, Vijay
Madisetti
Data Science and Big Data Analytics Discovering, Analyzing,
Visualizing and Presenting Data by EMC Education Services
Ben Spivey, Joey Echeverria - Hadoop Security: Protecting Your Big
Data Platform-O'Reilly Media (2015)
14

THANKYOU
For queries
![]()
APEX INSTITUTE OF TECHNOLOGY
DEPARTMENT: CSE
Bachelor of Engineering (Computer Science & Engineering)
Big Data Security
Hadoop Roles
& Separation
strategies
DISCOVER. LEARN. EMPOWER
![]()



![]()

![]()

![]()
![]()
![]()
![]()

![]()

![]()
![]()
![]()



![]()



Course Objective
Students will try to learn:
1 | To understand the concept of Big Data and define security control with core disciplines |
2 | To monitor data usage for modelling real-world problems |
3 | To secure and protect data |
2

Big Data Security
Course Outcome
CO | Title | Level |
1 | Recognize all security related issues in big data systems and | Understand |
2 | Understand cryptographic principles and mechanisms to | Remember |
3 | Identify security risks and challenges for Big Data system. | Apply |




Apache HDFS
NameNode
DataNode
JournalNode
HttpFS
NFS Gateway
KMS
Apache YARN
Cloudera Impala
Impala daemon (impalad)
StateStore
Catalog server
Apache Sentry (Incubating)
Sentry server
Policy database
Apache Hbase
Master
Cloudera Hue
Cloudera Hue
Kerberos Ticket Renewer
ResourceManager
JobHistory Server
NodeManager
Apache MapReduce
JobTracker (head)
TaskTracker (worker)
Apache Hive
Metastore database
Metastore server
HiveServer2
Hcatalog
RegionServer
REST server
Thrift server
Apache Accumulo
Master
TabletServer
GarbageCollector
Tracer
Apache Solr
Apache Oozie
Apache ZooKeeper
Apache Flume
Apache Sqoop

Following is a list of roles that should be run on dedicated master nodes:
HDFS NameNode, Secondary NameNode (or Standby NameNode), Failover‐Controller,
JournalNode, and KMS
MapReduce JobTracker and FailoverController
YARN ResourceManager and JobHistory Server
Hive Metastore Server
Impala Catalog Server and StateStore Server
Sentry Server
ZooKeeper Server
HBase Master
Accumulo Master, Tracer, and GarbageCollector

The typical roles found on worker nodes are the following:
HDFS DataNode
MapReduce TaskTracker
YARN NodeManager
Impala Daemon
HBase RegionServer
Accumulo TabletServer
SolrServer

The typical roles found on these nodes are:
Configuration management
Monitoring
Alerting
Software repositories
Backend databases

The following roles are typically found on edge nodes:
HDFS HttpFS and NFS gateway
Hive HiveServer2 and WebHCatServer
Network proxy/load balancer for Impala
Hue server and Kerberos ticket renewer
Oozie server
HBase Thrift server and REST server
Flume agent
Client configuration files

Assessment Pattern
S.No. | Item | Number/semester | Marks | System |
1 | MSTs | 2 | 36 (12 each) | Combined tests |
2 | Quiz | 2 | 4 | Once online |
3 | Surprise test | 1 | 12 | Teacher decides |
4 | Assignments | 3 (one per unit) | 10 | By teacher as per the dates specified |
5 | Tutorials | Depending on classes | 3 | In tutorial classes |
6 | Attendance | Above 90% | 2 | |
Internal (division as mentioned above points 1-6) | 40 | |||
External | 60 | |||
Total | 100 | |||
11

REFERENCES
Big Data Analytics A Hands-On Approach by Arshdeep Bahga, Vijay
Madisetti
Data Science and Big Data Analytics Discovering, Analyzing,
Visualizing and Presenting Data by EMC Education Services
Ben Spivey, Joey Echeverria - Hadoop Security: Protecting Your Big
Data Platform-O'Reilly Media (2015)
12

THANKYOU
For queries
![]()
APEX INSTITUTE OF TECHNOLOGY
DEPARTMENT: CSE
Bachelor of Engineering (Computer Science & Engineering)
Big Data Security
Data Management
DISCOVER. LEARN. EMPOWER
![]()



![]()

![]()

![]()
![]()
![]()
![]()

![]()




Course Objective
Students will try to learn:
1 | To understand the concept of Big Data and define security control with core disciplines |
2 | To monitor data usage for modelling real-world problems |
3 | To secure and protect data |
2

Big Data Security
Course Outcome
CO | Title | Level |
1 | Recognize all security related issues in big data systems and | Understand |
2 | Understand cryptographic principles and mechanisms to | Remember |
3 | Identify security risks and challenges for Big Data system. | Apply |

Definition
Data management is the process of ingesting, storing, organizing
and maintaining the data created and collected by an organization.
The data management process includes a combination of different
functions that collectively aim to make sure that the data in
corporate systems is accurate, available and accessible.

Types of data management functions
Data modeling
Data integration
Data governance
Master data management


Data management tools and techniques
Database management systems
RDBMS
NoSQL databases
document databases
key-value databases
Column oriented
graph databases
Big Data Management
Data warehouses and Data lakes
Data Integration
ETL
ELT
Data Governance, Data Quality and MDM
Data Modeling

Data management tasks and roles

Assessment Pattern
S.No. | Item | Number/semester | Marks | System |
1 | MSTs | 2 | 36 (12 each) | Combined tests |
2 | Quiz | 2 | 4 | Once online |
3 | Surprise test | 1 | 12 | Teacher decides |
4 | Assignments | 3 (one per unit) | 10 | By teacher as per the dates specified |
5 | Tutorials | Depending on classes | 3 | In tutorial classes |
6 | Attendance | Above 90% | 2 | |
Internal (division as mentioned above points 1-6) | 40 | |||
External | 60 | |||
Total | 100 | |||
8

REFERENCES
Big Data Analytics A Hands-On Approach by Arshdeep Bahga, Vijay
Madisetti
Data Science and Big Data Analytics Discovering, Analyzing,
Visualizing and Presenting Data by EMC Education Services
Ben Spivey, Joey Echeverria - Hadoop Security: Protecting Your Big
Data Platform-O'Reilly Media (2015)
9

THANKYOU
For queries
![]()
APEX INSTITUTE OF TECHNOLOGY
DEPARTMENT: CSE
Bachelor of Engineering (Computer Science & Engineering)
Big Data Security
IAM
DISCOVER. LEARN. EMPOWER
![]()



![]()

![]()

![]()
![]()
![]()
![]()

![]()




Course Objective
Students will try to learn:
1 | To understand the concept of Big Data and define security control with core disciplines |
2 | To monitor data usage for modelling real-world problems |
3 | To secure and protect data |
2

Big Data Security
Course Outcome
CO | Title | Level |
1 | Recognize all security related issues in big data systems and | Understand |
2 | Understand cryptographic principles and mechanisms to | Remember |
3 | Identify security risks and challenges for Big Data system. | Apply |

Definition
AWS Identity and Access Management (IAM) is a web service that
helps you securely control access to AWS resources.
If we breakdown the term Identity and Access Management,
Identity — stands for Authentication, and
Access — stands for Authorization.
In AWS, an API call is authenticated by signing the requests
in HMAC signature with the secret key.

Users - Using IAM, we can create and manage AWS users and use permissions to
allow and deny their access to AWS resources.
Groups- The users created can also be divided into groups and then the rules and
policies that apply on the group will also apply on the user level as well.
Roles - An IAM role is an IAM entity that defines a set of permissions for making
AWS service requests. Trusted entities such as IAM users, applications or AWS
services like EC2, Lambda etc. assumes these roles to carry out the task on our
behalf.
Policies- We create policies to assign permission to a user, group, role or
resource. It is a document that explicitly lists the permissions.

Service controlled policies (SCPs)
2.Identity based policies
3.Resource based policies

Service control policies (SCPs)

Identity-based policies
1.A policy that is attached to an identity in IAM is known as
an identity-based policy.
Identity-based policies can include
AWS managed policies,
customer managed policies, and
inline policies.

IAM policy structure

Resource-based policies
With resource-based policies, we can specify who
has access to the resource and what actions they
can perform on it.

Policy Evaluation

Assessment Pattern
S.No. | Item | Number/semester | Marks | System |
1 | MSTs | 2 | 36 (12 each) | Combined tests |
2 | Quiz | 2 | 4 | Once online |
3 | Surprise test | 1 | 12 | Teacher decides |
4 | Assignments | 3 (one per unit) | 10 | By teacher as per the dates specified |
5 | Tutorials | Depending on classes | 3 | In tutorial classes |
6 | Attendance | Above 90% | 2 | |
Internal (division as mentioned above points 1-6) | 40 | |||
External | 60 | |||
Total | 100 | |||
12

REFERENCES
Big Data Analytics A Hands-On Approach by Arshdeep Bahga, Vijay
Madisetti
Data Science and Big Data Analytics Discovering, Analyzing,
Visualizing and Presenting Data by EMC Education Services
Ben Spivey, Joey Echeverria - Hadoop Security: Protecting Your Big
Data Platform-O'Reilly Media (2015)
13

THANKYOU
For queries
![]()
APEX INSTITUTE OF TECHNOLOGY
DEPARTMENT: CSE
Bachelor of Engineering (Computer Science & Engineering)
Big Data Security
Network Security
DISCOVER. LEARN. EMPOWER
![]()



![]()

![]()

![]()
![]()
![]()
![]()

![]()




Course Objective
Students will try to learn:
1 | To understand the concept of Big Data and define security control with core disciplines |
2 | To monitor data usage for modelling real-world problems |
3 | To secure and protect data |
2

Big Data Security
Course Outcome
CO | Title | Level |
1 | Recognize all security related issues in big data systems and | Understand |
2 | Understand cryptographic principles and mechanisms to | Remember |
3 | Identify security risks and challenges for Big Data system. | Apply |

Network Segmentation
Physical
Logical
Hybrid

Data Movement
Client Access
Administration Traffic



Assessment Pattern
S.No. | Item | Number/semester | Marks | System |
1 | MSTs | 2 | 36 (12 each) | Combined tests |
2 | Quiz | 2 | 4 | Once online |
3 | Surprise test | 1 | 12 | Teacher decides |
4 | Assignments | 3 (one per unit) | 10 | By teacher as per the dates specified |
5 | Tutorials | Depending on classes | 3 | In tutorial classes |
6 | Attendance | Above 90% | 2 | |
Internal (division as mentioned above points 1-6) | 40 | |||
External | 60 | |||
Total | 100 | |||
8

REFERENCES
Big Data Analytics A Hands-On Approach by Arshdeep Bahga, Vijay
Madisetti
Data Science and Big Data Analytics Discovering, Analyzing,
Visualizing and Presenting Data by EMC Education Services
Ben Spivey, Joey Echeverria - Hadoop Security: Protecting Your Big
Data Platform-O'Reilly Media (2015)
9

THANKYOU
For queries
![]()
APEX INSTITUTE OF TECHNOLOGY
DEPARTMENT: CSE
Bachelor of Engineering (Computer Science & Engineering)
Big Data Security
Introduction
To Hadoop
DISCOVER. LEARN. EMPOWER
![]()



![]()

![]()

![]()
![]()
![]()
![]()

![]()




Course Objective
Students will try to learn:
1 | To understand the concept of Big Data and define security control with core disciplines |
2 | To monitor data usage for modelling real-world problems |
3 | To secure and protect data |
2

Big Data Security
Course Outcome
CO | Title | Level |
1 | Recognize all security related issues in big data systems and | Understand |
2 | Understand cryptographic principles and mechanisms to | Remember |
3 | Identify security risks and challenges for Big Data system. | Apply |

The hadoop.apache.org web site defines Hadoop as “a framework that allows
for the distributed processing of large data sets across clusters of computers
using simple programming models.”
Hadoop has two main components: the Hadoop Distributed File System (HDFS)
and a framework for processing large amounts of data in parallel using the
MapReduce paradigm.





An unauthorized client may access an HDFS file or cluster metadata via
the RPC or HTTP protocols (since the communication is unencrypted and
unsecured by default).
An unauthorized client may read/write a data block of a file at a
DataNode via the pipeline streaming data-transfer protocol (again,
unencrypted communication).
A task or node may masquerade as a Hadoop service component (such
as DataNode) and modify the metadata or perform destructive
activities.
A malicious user with network access could intercept unencrypted
internode communications.
Data on failed disks in a large Hadoop cluster can leak private
information if not handled properly.










Assessment Pattern
S.No. | Item | Number/semester | Marks | System |
1 | MSTs | 2 | 36 (12 each) | Combined tests |
2 | Quiz | 2 | 4 | Once online |
3 | Surprise test | 1 | 12 | Teacher decides |
4 | Assignments | 3 (one per unit) | 10 | By teacher as per the dates specified |
5 | Tutorials | Depending on classes | 3 | In tutorial classes |
6 | Attendance | Above 90% | 2 | |
Internal (division as mentioned above points 1-6) | 40 | |||
External | 60 | |||
Total | 100 | |||
17

REFERENCES
Big Data Analytics A Hands-On Approach by Arshdeep Bahga, Vijay
Madisetti
Data Science and Big Data Analytics Discovering, Analyzing,
Visualizing and Presenting Data by EMC Education Services
Ben Spivey, Joey Echeverria - Hadoop Security: Protecting Your Big
Data Platform-O'Reilly Media (2015)
18

THANKYOU
For queries
![]()



































































































































































































Department of Computer Science & Engineering
BIG DATA SECURITY
(CSC-482)
DR. JASPREET SINGH BATTH
E10279
ASSISTANT PROFESSOR
CSE (AIT), CU
DISCOVER . LEARN . EMPOWER
5/5/2021 Chandigarh University 1


About COURSE
To understand Hadoop Components and HDFS.
To learn about inherent security issues with HDFS.
To learn how Hadoop deals with inherent security issues.
To define Hadoop’s Operational Security Woes.
To learn about Authentication and Authorization.
To learn how to harness Fine grained authorization.
To learn about Hadoop loggings for Security.
To do case study about Ganglia and Nagios.
To know about Encryption of data at rest and in transit.
To study open source authentication in Hadoop
To learn about PuTTY’s Host Keys
To understand key based authentication using PuTTY
5/5/2021 Chandigarh University 2

COURSE OBJECTIVES
CO Number | Title | Level |
CO1 | Describe how the security for Big Data | Understand & |
CO2 | To evaluate the basics of Big Data Security and its case study for Big Data Applications. (Analytics Flow for Big Data) | Understand & |
CO3 | Hadoop Logging, Encryption of data in | Apply |
3
5/5/2021 Chandigarh University

COURSE OUTCOMES
To understand why Authentication and Authorization is
required.
To learn how to analyze security issues with HDFS.
To learn how to securely administering HDFS
5/5/2021 Chandigarh University 4

CONTENTS TO BE COVERED
Introduction to Kerberos for Hadoop security.
Preparing for Kerberos Implementation
Implementing Kerberos for Hadoop
Kerberos workflow example
Case Study for MIT Kerberos
5/5/2021 Chandigarh University 5

KERBEROS
In Greek mythology, a many headed dog, the
guardian of the entrance of Hades
5/5/2021 Chandigarh University 6


KERBEROS
Users wish to access services on servers.
Three threats exist:
User pretend to be another user.
User alter the network address of a workstation.
User eavesdrop on exchanges and use a replay attack.
5/5/2021 Chandigarh University 7

KERBEROS
Provides a centralized authentication server to authenticate users to
servers and servers to users.
Relies on conventional encryption, making no use of public-key
encryption
Two versions: version 4 and 5
Version 4 makes use of DES
5/5/2021 Chandigarh University 8

Kerberos Version 4
Terms:
C = Client
AS = authentication server
V = server
IDc = identifier of user on C
IDv = identifier of V
Pc = password of user on C
ADc = network address of C
Kv = secret encryption key shared by AS an V
TS = timestamp
|| = concatenation
5/5/2021 Chandigarh University 9

A Simple Authentication Dialogue
5/5/2021 Chandigarh University 10
(1)
(2)
(3)
C € AS:
AS € C:
C
€ V:
IDc || Pc || IDv
Ticket
IDc || Ticket
Ticket = EKv[IDc || Pc || IDv]

Version 4 Authentication Dialogue
Problems:
Lifetime associated with the ticket-granting ticket
If too short € repeatedly asked for password
If too long € greater opportunity to replay
The threat is that an opponent will steal the ticket
and use it before it expires
5/5/2021 Chandigarh University 11

Version 4 Authentication Dialogue
Client/Server Authentication Exhange: To Obtain Service
C € V: Ticketv || Authenticatorc
V € C: EKc,v[TS5 +1]
5/5/2021 Chandigarh University 12
Authentication Service Exhange: To obtain Ticket-Granting Ticket
(1)
(2)
C € AS:
AS € C:
IDc || IDtgs ||TS1
EKc [Kc,tgs|| IDtgs || TS2 || Lifetime2 || Tickettgs]
Ticket-Granting Service Echange: To obtain Service-Granting Ticket
(3) C € TGS:
(4) TGS € C:
IDv ||Tickettgs ||Authenticatorc
EKc [Kc,¨v|| IDv || TS4 || Ticketv]

Overview of Kerberos
5/5/2021 Chandigarh University 13


Request for Service in Another Realm
5/5/2021 Chandigarh University 14


Difference Between Version 4 and 5
Encryption system dependence (V.4 DES)
Internet protocol dependence
Message byte ordering
Ticket lifetime
Authentication forwarding
Interrealm authentication
5/5/2021 Chandigarh University 15

Kerberos Encryption Techniques
5/5/2021 Chandigarh University 16


PCBC Mode
5/5/2021 Chandigarh University 17


Kerberos - in practice
Currently have two Kerberos versions:
4 : restricted to a single realm
5 : allows inter-realm authentication, in beta test
Kerberos v5 is an Internet standard
specified in RFC1510, and used by many utilities
To use Kerberos:
need to have a KDC on your network
need to have Kerberised applications running on all participating systems
major problem - US export restrictions
Kerberos cannot be directly distributed outside the US in source format (&
binary versions must obscure crypto routine entry points and have no
encryption)
else crypto libraries must be reimplemented locally
5/5/2021 Chandigarh University 18

X.509 Authentication Service
Distributed set of servers that maintains a database about users.
Each certificate contains the public key of a user and is signed with
the private key of a CA.
Is used in S/MIME, IP Security, SSL/TLS and SET.
RSA is recommended to use.
5/5/2021 Chandigarh University 19

X.509 Formats
5/5/2021 Chandigarh University 20


Typical Digital Signature
Approach
5/5/2021 Chandigarh University 21


Obtaining a User’s Certificate
Characteristics of certificates generated by CA:
Any user with access to the public key of the CA can recover the user public
key that was certified.
No part other than the CA can modify the certificate without this being
detected.
5/5/2021 Chandigarh University 22

X.509 CA Hierarchy
5/5/2021 Chandigarh University 23


Revocation of Certificates
Reasons for revocation:
The users secret key is assumed to be compromised.
The user is no longer certified by this CA.
The CA’s certificate is assumed to be compromised.
5/5/2021 Chandigarh University 24

Authentication Procedures
5/5/2021 Chandigarh University 25


KEY POINTS
Introduction to Kerberos for Hadoop security.
Preparing for Kerberos Implementation
Implementing Kerberos for Hadoop
Kerberos workflow example
Case Study for MIT Kerberos
5/5/2021 Chandigarh University 26


LEARNING MATERIAL
5/5/2021 Chandigarh University 27


ASSESSMENT PATTERN
5/5/2021 Chandigarh University 28


Please Send Your Queries on:
e-Mail: jaspreet.e10279@cumail.in
5/5/2021 Chandigarh University 29

Department of Computer Science & Engineering
BIG DATA SECURITY
(CSC-482)
DR. JASPREET SINGH BATTH
E10279
ASSISTANT PROFESSOR
CSE (AIT), CU
DISCOVER . LEARN . EMPOWER
5/5/2021 Chandigarh University 1


About COURSE
To understand Hadoop Components and HDFS.
To learn about inherent security issues with HDFS.
To learn how Hadoop deals with inherent security issues.
To define Hadoop’s Operational Security Woes.
To learn about Authentication and Authorization.
To learn how to harness Fine grained authorization.
To learn about Hadoop loggings for Security.
To do case study about Ganglia and Nagios.
To know about Encryption of data at rest and in transit.
To study open source authentication in Hadoop
To learn about PuTTY’s Host Keys
To understand key based authentication using PuTTY
5/5/2021 Chandigarh University 2

COURSE OBJECTIVES
CO Number | Title | Level |
CO1 | Describe how the security for Big Data | Understand & |
CO2 | To evaluate the basics of Big Data Security and its case study for Big Data Applications. (Analytics Flow for Big Data) | Understand & |
CO3 | Hadoop Logging, Encryption of data in | Apply |
3
5/5/2021 Chandigarh University

COURSE OUTCOMES
To understand why Authentication and Authorization is
required.
To learn how to analyze security issues with HDFS.
To learn how to securely administering HDFS
5/5/2021 Chandigarh University 4

CONTENTS TO BE COVERED
Introduction to Kerberos for Hadoop security.
Preparing for Kerberos Implementation
Implementing Kerberos for Hadoop
Kerberos workflow example
Case Study for MIT Kerberos
5/5/2021 Chandigarh University 5

MIT Kerberos History
Designed as part of MIT’s Project Athena in the 1980’s
Kerberos v4 published in 1987
Migration to the IETF
RFC 1510 (Kerberos v5, 1993)
Used in a number of products
Example: part of Windows 2000
MS Passport is essentially Kerberos done w/ client-side cookies over HTTP
5/5/2021 Chandigarh University 6

MIT Kerberos
Designed for single “administration domain” of machines & users: users,
client machines, server machines, and the Key Distribution Center (KDC)
No public key crypto
Provides authentication & encryption services
“Kerberized” servers provide authorization on top of the authenticated
identities
5/5/2021 Chandigarh University 7

The Kerberos Model
Clients
Servers
The Key Distribution Center (KDC)
Centralized trust model
KDC is trusted by all clients & servers
KDC shares a secret, symmetric key with each client and server
A “realm” is single trust domain consisting of one or more clients, servers,
KDCs
5/5/2021 Chandigarh University 8

Picture of a Kerberos Realm
Key Distribution
Center (KDC)
Client
Ticket Granting
Server (TGS)
Server
5/5/2021 Chandigarh University 9





Joining a Kerberos Realm
One-time setup
Each client, server that wishes to participate in the realm exchanges a secret key
with the KDC
If the KDC is compromised, the entire system is cracked
Because the KDC knows everyone’s individual secret key, the KDC can
issue credentials to each realm identity
5/5/2021 Chandigarh University 10

Kerberos Credentials
Two types of credentials in Kerberos
Tickets
Authenticators
Tickets are credentials issued to a client for communication with a specific
server
Authenticators are additional credentials that prove a client knows a key
at a point in time
Basic idea: encrypt a “nonce”
5/5/2021 Chandigarh University 11

The Basic Kerberos Protocol
Assume client C wishes to authenticate to and communicate with server S
Phase 1: C gets a Ticket-Granting Ticket (TGT) from the KDC
Phase 2: C uses the TGT to get a Ticket for S
Phase 3: C communicates with S
5/5/2021 Chandigarh University 12

Protocol Definitions
Following Schneier (Section 24.5):
C = client, S = server
TGS = ticket-granting service
Kx = x’s secret key
Kx,y = session key for x and y
{m}Kx = m encrypted in x’s secret key
Tx,y = x’s ticket to use y
Ax,y = authenticator from x to y
Nx = a nonce generated by x
5/5/2021 Chandigarh University 13

The Basic Kerberos Protocol (1)
Phase 1: C gets a Ticket-Granting Ticket
C sends a request to the KDC for a “ticket-granting ticket” (TGT)
A TGT is a ticket used to talk to the special ticket-granting service
A TGT is relatively long-lived (~8-24 hours typically)
C KDC: C, TGS, NC
Sent in the clear!
5/5/2021 Chandigarh University 14

The Basic Kerberos Protocol (2)
Phase 1: C gets a Ticket-Granting Ticket
KDC responds with two items
The ticket-granting ticket
A ticket for C to talk to TGS
A copy of the session key to use to talk to TGS, encrypted in C’s shared key
KDC C: {TC,TGS}KTGS , {KC,TGS}KC
where Tc,s = s, {c, c-addr, lifetime, Kc,s}Ks
Only the TGS can decrypt the ticket
C can unlock the second part to retrieve KC,TGS
5/5/2021 Chandigarh University 15

Picture of a Kerberos Realm
Key Distribution
Center (KDC)
C KDC: C, TGS, NC KDC C: {TC,TGS}KTGS , {KC,TGS}KC
where Tc,s = s, {c, c-addr, lifetime, Kc,s}Ks
Client
5/5/2021 Chandigarh University 16



The Basic Kerberos Protocol (3)
Phase 2: C gets a Ticket for S
C requests a ticket to communicate with S from the
ticket-granting service (TGS)
C sends TGT to S along with an authenticator requesting
a ticket from C to S
C TGS: {AC,S}KC,TGS , {TC,TGS}KTGS
where Ac,s = {c, timestamp, opt. subkey}Kc,s
First part proves to TGS that C knows the session key
Second part is the TGT C got from the KDC
5/5/2021 Chandigarh University 17

The Basic Kerberos Protocol (4)
Phase 2: C gets a Ticket for S
TGS returns a ticket for C to talk to S
(Just like step 2 above...)
TGS C: {TC,S}KS , {KC,S}KC,TGS
Only S can decrypt the ticket
C can unlock the second part to retrieve KC,S
5/5/2021 Chandigarh University 18

Picture of a Kerberos Realm
Ticket Granting
Server (TGS)
C TGS: {AC,S}KC,TGS , {TC,TGS}KTGS
where Ac,s = {c, timestamp, opt. subkey}Kc,s
TGS C: {TC,S}KS , {KC,S}KC,TGS
Client
5/5/2021 Chandigarh University 19



The Basic Kerberos Protocol (5)
Phase 3: C communicates with S
C sends the ticket to S along with an authenticator to establish a
shared secret
C S: {AC,S}KC,S , {TC,S}KS
where Ac,s = {c, timestamp, opt. subkey}Kc,s
S decrypts the ticket TC,S to get the shared secret KC,S needed to
communicate securely with C
5/5/2021 Chandigarh University 20

The Basic Kerberos Protocol (6)
Phase 3: C communicates with S
S decrypts the ticket to obtain the KC,S and replies to C with proof of
possession of the shared secret (optional step)
S C: {timestamp, opt. subkey}Kc,s
Notice that S had to decrypt the authenticator, extract the timestamp
& opt. subkey, and re-encrypt those two components with Kc,s
5/5/2021 Chandigarh University 21

Picture of a Kerberos Realm
C S: {AC,S}KC,S , {TC,S}KS
where Ac,s = {c, timestamp, opt. subkey}Kc,s
Client
S C: {timestamp, opt. subkey}Kc,s
Server
5/5/2021 Chandigarh University 22



Picture of a Kerberos Realm
Key Distribution
Center (KDC)
Ticket Granting
Server (TGS)
TGT Request TGT
Ticket
Request
Ticket
Ticket + service request
Client
“Do some stuff”
Server
5/5/2021 Chandigarh University 23





Thoughts on Kerberos...
There’s no public key crypto anywhere in the base Kerberos spec, but you
can modify the base protocols to use PK...
Example: the initial “login” to the KDC could be done with public key for added
security (e.g. PKINIT protocol)
5/5/2021 Chandigarh University 24

KEY POINTS
Introduction to Kerberos for Hadoop security.
Preparing for Kerberos Implementation
Implementing Kerberos for Hadoop
Kerberos workflow example
Case Study for MIT Kerberos
5/5/2021 Chandigarh University 25


LEARNING MATERIAL
5/5/2021 Chandigarh University 26


ASSESSMENT PATTERN
5/5/2021 Chandigarh University 27


Please Send Your Queries on:
e-Mail: jaspreet.e10279@cumail.in
5/5/2021 Chandigarh University 28

Department of Computer Science & Engineering
BIG DATA SECURITY
(CSC-482)
DR. JASPREET SINGH BATTH
E10279
ASSISTANT PROFESSOR
CSE (AIT), CU
DISCOVER . LEARN . EMPOWER
5/5/2021 Chandigarh University 1


About COURSE
To understand Hadoop Components and HDFS.
To learn about inherent security issues with HDFS.
To learn how Hadoop deals with inherent security issues.
To define Hadoop’s Operational Security Woes.
To learn about Authentication and Authorization.
To learn how to harness Fine grained authorization.
To learn about Hadoop loggings for Security.
To do case study about Ganglia and Nagios.
To know about Encryption of data at rest and in transit.
To study open source authentication in Hadoop
To learn about PuTTY’s Host Keys
To understand key based authentication using PuTTY
5/5/2021 Chandigarh University 2

COURSE OBJECTIVES
CO Number | Title | Level |
CO1 | Describe how the security for Big Data | Understand & |
CO2 | To evaluate the basics of Big Data Security and its case study for Big Data Applications. (Analytics Flow for Big Data) | Understand & |
CO3 | Hadoop Logging, Encryption of data in | Apply |
3
5/5/2021 Chandigarh University

COURSE OUTCOMES
To understand why Authentication and Authorization is
required.
To learn how to analyze security issues with HDFS.
To learn how to securely administering HDFS
5/5/2021 Chandigarh University 4

CONTENTS TO BE COVERED
Introduction to Kerberos for Hadoop security.
Preparing for Kerberos Implementation
Implementing Kerberos for Hadoop
Kerberos workflow example
Case Study for MIT Kerberos
5/5/2021 Chandigarh University 5

PKINIT in Windows 2K/2K3
Key Distribution
Center (KDC)
Logon request
Verification and
NT user account
lookup
Active
Directory
using Public Key Kerberos Ticket
Granting Ticket (TGT)
Reader
SC
rt
Ce Client
5/5/2021 Chandigarh University 6


![]()
![]()
![]()
![]()
![]()

![]()
![]()
![]()
![]()
![]()

![]()



Thoughts on Kerberos...(2)
Only the KDC needs to know the user’s password (used to generate the
shared secret)
You can have multiple KDCs for redundancy, but they all need to have a copy of the
username/password database
Only the TGS needs to know the secret keys for the servers
You can split KDC from TGS, but it is common for those two services to reside on
the same physical machine
5/5/2021 Chandigarh University 7

Thoughts on Kerberos...(3)
Cross-realm trust is possible
Just need to share a secret key between the KDCs for the two realms...
Once accomplished, a user in realm A can get a ticket for a service in realm B
5/5/2021 Chandigarh University 8

Thoughts on Kerberos...(4)
“Time” is very important in Kerberos
All participants in the realm need accurate clocks
Timestamps are used in authenticators to detect replay; if a host can be fooled
about the current time, old authenticators could be replayed
Tickets tend to have lifetimes on the order of hours, and replays are possible
during the lifetime of the ticket
5/5/2021 Chandigarh University 9

Thoughts on Kerberos...(5)
Password-guessing attacks are possible
Capture enough encrypted tickets and you can brute-force decrypt them to
discover shared keys
(Another reason to use public key...)
5/5/2021 Chandigarh University 10

Thoughts on Kerberos...(6)
It’s possible to screw up the implementation
In fact, Kerberos v4 had a colossal security breach due to bad implementations
5/5/2021 Chandigarh University 11

RNGs in Kerberos v4
Session keys were generated from a PRNG seeded with the XOR of the
following:
Time-of-day in seconds since 1/1/1970
Process ID of the Kerberos server process
Cumulative count of session keys generated
Fractional part of time-of-day seconds
Hostid of the machine running the server
5/5/2021 Chandigarh University 12

RNGs in Kerberos v4 (continued)
The seed is a 32-bit value, so while the session key is used for DES (64 bits
long, normally 56 bits of entropy), it has only 32 bits of entropy
What’s worse, the five values have predictable portions
Time is completely predictable
ProcessID is mostly predictable
Even hostID has 12 predictable bits (of 32 total)
5/5/2021 Chandigarh University 13

RNGs in Kerberos v4 (continued)
Of the 32 seed bits, only 20 bits really change with any frequency, so
Kerberos v4 keys (in the MIT implementation) only have 20 bits of
randomness
They could be brute-force discovered in seconds
The hole was in the MIT Kerberos sources for seven years!
5/5/2021 Chandigarh University 14

Securing Internet Traffic
Application-level security
Secure the traffic between two communicating applications
Application-specific protocols
Example: SSL/TLS for web traffic
IP-level security
Secure traffic at the Internet Protocol layer (low-level wire format)
Applications don’t have to know about security specifically, they “get it for free”
Example: IPSEC
5/5/2021 Chandigarh University 15

Common Themes
Three phases
Authentication
Verify the other party is someone you want to talk to
Key agreement
Agree on data encryption and integrity protection keys
Encrypted data exchange
Communicate over the encrypted channel
5/5/2021 Chandigarh University 16

SSL/TLS
5/5/2021 Chandigarh University 17

App-Level Security: SSL/TLS
5/5/2021 Chandigarh University 18


SSL/PCT/TLS History
1994: Secure Sockets Layer (SSL) V2.0
1995: Private Communication Technology (PCT) V1.0
1996: Secure Sockets Layer (SSL) V3.0
1997: Private Communication Technology (PCT) V4.0
1999: Transport Layer Security (TLS) V1.0
2005/2006: TLS V1.1 (currently in the RFC Editor’s Queue
awaiting publication)
5/5/2021 Chandigarh University 19

Typical Scenario
You (client) Merchant (server)
Let’s talk securely.
Here is my RSA public key.
Here is a symmetric key, encrypted with your
public key, that we can use to talk.
5/5/2021 Chandigarh University 20

![]()
![]()
![]()
SSL/TLS
You (client) Merchant (server)
Let’s talk securely.
Here is my RSA public key.
Here is a symmetric key, encrypted with your
public key, that we can use to talk.
5/5/2021 Chandigarh University 21

SSL/TLS
You (client) Merchant (server)
Let’s talk securely.
Here are the protocols and ciphers I understand.
Here is my RSA public key.
Here is a symmetric key, encrypted with your
public key, that we can use to talk.
5/5/2021 Chandigarh University 22

SSL/TLS
You (client) Merchant (server)
Let’s talk securely.
Here are the protocols and ciphers I understand.
I choose this protocol and ciphers.
Here is my public key and
some other stuff.
Here is a symmetric key, encrypted with your
public key, that we can use to talk.
5/5/2021 Chandigarh University 23

SSL/TLS
You (client) Merchant (server)
Let’s talk securely.
Here are the protocols and ciphers I understand.
I choose this protocol and ciphers.
Here is my public key and
some other stuff.
Using your public key, I’ve encrypted a
random symmetric key to you.
5/5/2021 Chandigarh University 24

SSL/TLS
All subsequent secure messages are
sent using the symmetric key and a
keyed hash for message authentication.
5/5/2021 Chandigarh University 25

The five phases of SSL/TLS
Negotiate the ciphersuite to be used
Establish the shared session key
Client authenticates the server
(“server auth”)
Optional, but almost always done
Server authenticates the client
(“client auth”)
Optional, and almost never done
Authenticate previously exchanged data
5/5/2021 Chandigarh University 26

Phase 1: Ciphersuite Negotiation
Client hello (clientserver)
“Hi! I speak these n ciphersuites, and here’s a 28-byte random number (nonce) I
just picked”
Server hello (clientserver)
“Hello. We’re going to use this particular ciphersuite, and here’s a 28-byte nonce I
just picked.”
Other info can be passed along (we’ll see why a little later...)
5/5/2021 Chandigarh University 27

TLS V1.0 ciphersuites
TLS_NULL_WITH_NULL_NULL
TLS_RSA_WITH_NULL_MD5
TLS_RSA_WITH_NULL_SHA
TLS_RSA_EXPORT_WITH_RC4_40_MD5
TLS_RSA_WITH_RC4_128_MD5
TLS_RSA_WITH_RC4_128_SHA
TLS_RSA_EXPORT_WITH_RC2_CBC_40_MD
5
TLS_RSA_WITH_IDEA_CBC_SHA
TLS_RSA_EXPORT_WITH_DES40_CBC_SHA
TLS_RSA_WITH_DES_CBC_SHA
TLS_RSA_WITH_3DES_EDE_CBC_SHA
TLS_DH_DSS_EXPORT_WITH_DES40_CBC_
SHA
TLS_DH_DSS_WITH_DES_CBC_SHA
TLS_DH_DSS_WITH_3DES_EDE_CBC_SHA
TLS_DH_RSA_EXPORT_WITH_DES40_CBC_SH
A
TLS_DH_RSA_WITH_DES_CBC_SHA
TLS_DH_RSA_WITH_3DES_EDE_CBC_SHA
TLS_DHE_DSS_EXPORT_WITH_DES40_CBC_S
HA
TLS_DHE_DSS_WITH_DES_CBC_SHA
TLS_DHE_DSS_WITH_3DES_EDE_CBC_SHA
TLS_DHE_RSA_EXPORT_WITH_DES40_CBC_S
HA
TLS_DHE_RSA_WITH_DES_CBC_SHA
TLS_DHE_RSA_WITH_3DES_EDE_CBC_SHA
TLS_DH_anon_EXPORT_WITH_RC4_40_MD5
TLS_DH_anon_WITH_RC4_128_MD5
TLS_DH_anon_EXPORT_WITH_DES40_CBC_S
HA
TLS_DH_anon_WITH_DES_CBC_SHA
TLS_DH_anon_WITH_3DES_EDE_CBC_SHA
5/5/2021 Chandigarh University 28

TLS-With-AES ciphersuites
(RFC 3268)
TLS_RSA_WITH_AES_128_CBC_SHA RSA
TLS_DH_DSS_WITH_AES_128_CBC_SHA DH_DSS
TLS_DH_RSA_WITH_AES_128_CBC_SHA DH_RSA
TLS_DHE_DSS_WITH_AES_128_CBC_SHA DHE_DSS
TLS_DHE_RSA_WITH_AES_128_CBC_SHA DHE_RSA
TLS_DH_anon_WITH_AES_128_CBC_SHA DH_anon
TLS_RSA_WITH_AES_256_CBC_SHA RSA
TLS_DH_DSS_WITH_AES_256_CBC_SHA DH_DSS
TLS_DH_RSA_WITH_AES_256_CBC_SHA DH_RSA
TLS_DHE_DSS_WITH_AES_256_CBC_SHA DHE_DSS
TLS_DHE_RSA_WITH_AES_256_CBC_SHA DHE_RSA
TLS_DH_anon_WITH_AES_256_CBC_SHA DH_anon
5/5/2021 Chandigarh University 29

Phase 2: Establish the shared session key
Client key exchange
Client chooses a 48-byte “pre-master secret”
Client encrypts the pre-master secret with the server’s RSA
public key
Clientserver encrypted pre-master secret
Client and server both compute
PRF (pre-master secret, “master secret”, client nonce + server
nonce)
PRF is a pseudo-random function
First 48 bytes output from PRF form master secret
5/5/2021 Chandigarh University 30

TLS’s PRF
PRF(secret, label, seed) =
P_MD5(S1, label + seed) XOR
P_SHA-1(S2, label + seed);
where S1, S2 are the two halves of the secret
P_hash(secret, seed) =
HMAC_hash(secret, A(1) + seed) + HMAC_hash(secret, A(2) + seed) +
HMAC_hash(secret, A(3) + seed) + ...
A(0) = seed
A(i) = HMAC_hash(secret, A(i-1))
5/5/2021 Chandigarh University 31

Phases 3 & 4: Authentication
More on this in a moment...
5/5/2021 Chandigarh University 32

Phase 5: Authenticate previously
exchanged data
“Change ciphersuites” message
Time to start sending data for real...
“Finished” handshake message
First protected message, verifies algorithm parameters for the encrypted channel
12 bytes from:
PRF(master_secret, “client finished”, MD5(handshake_messages) +
SHA-1(handshake_messages))
5/5/2021 Chandigarh University 33

Why do I trust the server key?
How do I know I’m really talking to Amazon.com?
What defeats a man-in-the-middle attack?
Client
HTTP with SSL/TLS
Web
Server
5/5/2021 Chandigarh University 34



Why do I trust the server key?
How do I know I’m really talking to Amazon.com?
What defeats a man-in-the-middle attack?
Client
HTTP with
SSL/TLS
Mallet
HTTP with
SSL/TLS
Web
Server
5/5/2021 Chandigarh University 35




SSL/TLS
You (client) Merchant (server)
Let’s talk securely.
Here are the protocols and ciphers I understand.
I choose this protocol and ciphers.
Here is my public key and
some other stuff that will make you
trust this key is mine.
Here is a fresh key encrypted with your key.
5/5/2021 Chandigarh University 36

What’s the “some other stuff”
How can we convince Alice that some key belongs to Bob?
Alice and Bob could have met previously & exchanged keys directly.
Jeff Bezos isn’t going to shake hands with everyone he’d like to sell to...
Someone Alice trusts could vouch to her for Bob and Bob’s key
A third party can certify Bob’s key in a way that convinces Alice.
5/5/2021 Chandigarh University 37

What is a certificate?
A certificate is a digitally-signed statement that binds a public key to some
identifying information.
The signer of the certificate is called its issuer.
The entity talked about in the certificate is the subject of the certificate.
That’s all a certificate is, at the 30,000’ level.
5/5/2021 Chandigarh University 38

Defeating Mallet
Bob can convince Alice that his key really does belong to
him if he can also send along a digital certificate Alice
will believe & trust
Let’s talk securely.
d.
Here are the protocols and ciphers I understan
Cert
Alice
Cert
I choose this protocol and ciphers.
Here is my public key and
a certificate to convince you that the
key really belongs to me.
Bob
5/5/2021 Chandigarh University 39

![]()
![]()
![]()
![]()


Server & Client Authentication
with Certificates
We’re going to talk a lot more about how you determine whether you
trust a name-key binding later in the course
Lecture #8: Trust, Public Key Infrastructure (PKI) and Key Management
For now, simply assume that each client and server can:
Cryptographically validate a certificate to verify its integrity
Decide whether a validated certificate should be believed according to its trust
5/5/2021 Chandigarh University 40

KEY POINTS
Introduction to Kerberos for Hadoop security.
Preparing for Kerberos Implementation
Implementing Kerberos for Hadoop
Kerberos workflow example
Case Study for MIT Kerberos
5/5/2021 Chandigarh University 41


LEARNING MATERIAL
5/5/2021 Chandigarh University 42


ASSESSMENT PATTERN
5/5/2021 Chandigarh University 43


Please Send Your Queries on:
e-Mail: jaspreet.e10279@cumail.in
5/5/2021 Chandigarh University 44

Department of Computer Science & Engineering
BIG DATA SECURITY
(CSC-482)
DR. JASPREET SINGH BATTH
E10279
ASSISTANT PROFESSOR
CSE (AIT), CU
DISCOVER . LEARN . EMPOWER
5/5/2021 Chandigarh University 1


About COURSE
To understand Hadoop Components and HDFS.
To learn about inherent security issues with HDFS.
To learn how Hadoop deals with inherent security issues.
To define Hadoop’s Operational Security Woes.
To learn about Authentication and Authorization.
To learn how to harness Fine grained authorization.
To learn about Hadoop loggings for Security.
To do case study about Ganglia and Nagios.
To know about Encryption of data at rest and in transit.
To study open source authentication in Hadoop
To learn about PuTTY’s Host Keys
To understand key based authentication using PuTTY
5/5/2021 Chandigarh University 2

COURSE OBJECTIVES
CO Number | Title | Level |
CO1 | Describe how the security for Big Data | Understand & |
CO2 | To evaluate the basics of Big Data Security and its case study for Big Data Applications. (Analytics Flow for Big Data) | Understand & |
CO3 | Hadoop Logging, Encryption of data in | Apply |
3
5/5/2021 Chandigarh University

COURSE OUTCOMES
To understand why Authentication and Authorization is
required.
To learn how to analyze security issues with HDFS.
To learn how to securely administering HDFS
5/5/2021 Chandigarh University 4

CONTENTS TO BE COVERED
Various aspects of data security
Apache Sentry for authorization
Key concepts of Apache Sentry
Sentry features
Sentry architecture
Integration with Hadoop ecosystem
Sentry administration
5/5/2021 Chandigarh University 5

Who am I
Software engineer at Cloudera
Committer and PPMC member of Apache Sentry
also for Apache Hive and Apache Flume
Part of the the original team that started Sentry work

Aspects of security
Perimeter
Authentication
Kerberos, LDAP/AD
Access
Authorization
what user can do
with data
Visibility
Audit, Lineage
data origin, usage
Data
Encryption,
Masking

Data acces
Access
Authorization
what user can do
with data
Provide user access to data
Manage access policies
Provide role based access

Agenda
Various aspects of data security
Apache Sentry for authorization
Key concepts of Apache Sentry
Sentry features
Sentry architecture
Integration with Hadoop ecosystem
Sentry administration
Future plans
Demo
Questions

Apache Sentry (Incubating)
Unified Authorization module for Hadoop
Unlocks Key RBAC Requirements
Secure, fine-grained, role-based authorization
Multi-tenant administration
Enforce a common set of policies across multiple
data access path in Hadoop.

Key Capabilities of Sentry
Fine-Grained Authorization
Permissions on object hierarchie. Eg, Database, Table,
Columns
Role-Based Authorization
Support for role templetes to manage authorization
for a large set of users and data objects
Multi Tanent Administration
Ability to delegate admin responsibilities for a subset of
resources


Project history and status
Started at Cloudera
Entered incubation in 2013
Growing community
Committers from Cloudera, IBM, Intel, Oracle, …
Three releases from incubation
Widely adopted by industry
Part of multiple commercial Hadoop distros


Agenda
Various aspects of data security
Apache Sentry for authorization
Key concepts of Apache Sentry
Sentry features
Sentry architecture
Integration with Hadoop ecosystem
Sentry administration
Future plans
Demo
Questions

Key Concepts in Sentry
Global concepts
User, Group, Role, Privilege
Authorization Models
SQL
Server, Database, Table, URI
Search Model
Collection
©2014 Cloudera, Inc. All rights reserved.

Global Concept: User
Individual person
Runs SQL, SOLR queries
Identified by authentication provider
Kerberos, LDAP etc
Just a string for Sentry
Not enforcing existence
Sentry is NOT an authentication system
©2014 Cloudera, Inc. All rights reserved.


Global Concept: Group
Set of users
Same needs/privileges
Plugable group mapping
Using Hadoop Groups
OS, LDAP, Active Directory
©2014 Cloudera, Inc. All rights reserved.


Global Concept: Privilege
Unit of data access
Tuple
Object
Action
Always positive
READ TABLE logs
READ DATABASE prod
WRITE and READ TABLE logs
QUERY COLLECTION logs
UPDATE COLLECTION admin
©2014 Cloudera, Inc. All rights reserved.






Global Concept: Role
Set of privileges
Functional template
Unit of grant
Analyst
Analyst Junior
Warehouse admin
Warehouse user
Project X
©2014 Cloudera, Inc. All rights reserved.






Global Concepts: Relations
Groups have multiple users
Role have multiple privileges
Roles are assigned to groups
Sentry does not support direct grants to user
No jumping
User to role, group to privilege, …
User Group Role Privilege
©2014 Cloudera, Inc. All rights reserved.





![]()
![]()
![]()
Agenda
Various aspects of data security
Apache Sentry for authorization
Key concepts of Apache Sentry
Sentry features
Sentry architecture
Integration with Hadoop ecosystem
Sentry administration
Future plans
Demo
Questions

Sentry features – Fine grain authorization
Privileges at various levels for resource hierarch
Eg Database, Table and Column for SQL model
Read or Select access on Database implicitly grant access on child tables
Supports different actions on resources
Eg, Select, Insert, Create, Alter in the SQL model
Query, Update in Search model
…

Sentry Features – Role based
Supports Role as collection of permission
Template for a functional access rules
Eg, Analyst role Read table sales, Read table customer, Admin of Sandbox
Makes auth administration manageable in large and complex deployments
Allows granting roles to groups
A role can be granted to a large set of users in a single operation
Easier integration with existing identity management systems like AD
Onboarding and removing users is lot simpler with roles and groups

Sentry features – misc
Multi Tenant administration
Ability to delegate admin access for a subset of resources
Eg. A user can be an admin of his/her own sandbox database
Plugable architecture
A new authorization model can be implemented with little code changes
Can easily integration with new identity management systems for groups
Supports various callbacks for custom monitoring

Agenda
Various aspects of data security
Apache Sentry for authorization
Key concepts of Apache Sentry
Sentry features
Sentry architecture
Integration with Hadoop ecosystem
Sentry administration
Future plans
Demo
Questions

Apache Sentry conceptual overview
NameNode
SOLR
HiveServer2
Impala
HDFS
Search
Hive
Impala
Binding
Layer
Policy Engine
Policy Provider
File
Database
Sqoop2
sqoop

![]()






![]()
![]()
![]()
![]()

![]()
![]()

![]()

![]()
![]()

![]()

![]()
![]()

![]()
![]()
![]()

Apache Sentry conceptual overview
Policy Provider
Abstraction for loading and manipulating privilege metadata
Support for external DB backed storage (default)
Also support local or HDFS file storage (deprecated)
Policy Engine
Makes the authorization decision
Reads the metadata from policy provider
Binding
Bridging layer between the downstream service and Sentry
Handles translating the native access request into Sentry APIs

Sentry Service Architecture
Sentry
Plugin
Data
Engine
Sentry
server
Policy
Metadata
Data Engine, eg Hive
Sentry plugin
Sentry RPC server
Policy metadata store

![]()
![]()
Sentry Service
RPC Service to manage metadata
Apache Thrift RCP implementation
Java client
Secured with kerberos
API to retrieve and manipulate policies
Metadata stored in external backend DB
Supports Derby, MySQL, Postgres, Oracle and DB2

Sentry Service HA
Sentry Policy Store | Config | ||
Sentry Service | |||
ZooKeeper
Sentry Policy Store | Config | ||
Sentry Service | |||
DB
Sentry Service
Client
Sentry Policy Store | Config | ||
Sentry Service | |||

![]()
![]()
Sentry Service HA
Active/Active HA
Each service registers with ZK
Client first retrieves service address for ZK
User Apache Curator framework

File based privileged metadata
Policy information can be stored in local or HDFS files
Deprecated in newer releases in favor of DB based policies
ini format property file
# group to role mapping
[groups]
manager = analyst_role, junior_analyst_role
analyst = analyst_role
admin = admin_role
# role to privilege mapping
[roles]
analyst_role = server=server1->db=analyst1, \
server=server1->db=jranalyst1->table=*->action=select, \
server=server1->db=default->table=tab2

Sentry Client Plugin
Client side piece of Sentry
Integrates via the authorization interfaces
Responsible for authorization decision
Receives requested resources and user from caller
Retrieves relevant privileges from Sentry service
Evalues the request

Auditing
Sentry service generates audit trails
Policy changes are audited
Eg granting privileges, create/drop roles
Audit JSON format audit log
Easier for processing by audit reporting tools
Client side auditing handled by client’s auth auditing mechanism
Eg Hive and Impala
Sentry support client callbacks which can be used for customization

Agenda
Various aspects of data security
Apache Sentry for authorization
Key concepts of Apache Sentry
Sentry features
Sentry architecture
Integration with Hadoop ecosystem
Sentry administration
Future plans
Demo
Questions

Integration with Hadoop Ecosystem
Impala
Hive
Server2
Admin App
Sentry
Plugin
Sentry
Plugin
Audit
Catalog
Sentry
Plugin
Cache
Impala
Sentry
Plugin
Cache
Sentry
Plugin
NameNode
Hive
Metastore
Policy
Meta
data
Sentry
Plugin
Group Mapping
Authentication
Trail
Cache
Sentry
Plugin
Sqoop2
HDFS

![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()


![]()

![]()

![]()
![]()
Unified authorization for Hadoop ecosystem
Single source of truth
Other projects don’t have to implement their own auth
Same set of roles and group available across tools
Makes the authorization administration lot simpler
Same privileges enforced irrespective of the access path

SQL on Hadoop
Hive
Metastore
Hive
Server2
Privileges
MR, Yarn, ..
Schema
Metadata
Impala
Data
Data on
HDFS,
owned
by Hive
Metada
ta in
Hive
Metast
ore
Auth
policies
in Sentry

![]()
![]()

![]()

Sentry with Apache Hive
SQL
Parse
Validates access to SQL
entities before
executing the query.
Build
Check
Sentry
Plan
MR Query

Sentry with Apache Hive
Requires HiveServer2, not supported with thick hive client
SQL model with fine grained authorization
DB objects - Database, Table and Column
DB Actions - SELECT, INSERT, CREATE, ALTER, ..
Support managed and external tables
Special handling of external path specification via URI level privilege
Authorization administration via SQL
grant, revoke, create/drop role etc.

Sentry with Impala
Policy
Metadata
Impala
Catalog
Sentry
Plugin
Cache
Impal
Sentry
Plugin
a
Cache
a
Impal
Validates access to SQL
entities before executing
the query using the
cached privileges.
Sentry
Plugin
Cache

![]()

![]()
![]()
![]()
![]()
Sentry with Impala
Uses same SQL model as Hive
Fine grained authorization
DB objects - Database, Table, Column
DB Actions - SELECT, INSERT, CREATE, ALTER, ..
Support managed and external tables
Special handling of external path specification via URI level privilege
Impala engine caches privilege metadata for faster access

View level privileges for SQLauthorization
Views are essentially queries defined on one or more tables
Eg CREATE VIEW v1 AS SELECT tab1.col1, tab2.col2 FROM tab1, tab2 …
Privileges on views are independent of the base tables
This enables row/cell level privileges
Requires data files to be owned and access by Hive user

URI level privilege
Hive SQL supports file URI leading to security loopholes
Alternate storage path for tables
Create table
Alter table
External table
One can specify the path of a different table and bypass authorization
ALTER TABLE sandbox.sales SET LOCATION ‘/user/hive/warehouse/production/sales’
Hive UDFs using jars with untrusted/unauthorized static code
URI resource privilege to can be used to prevent this
A file URI can only be used if you have explicit grant to use it

Sentry with Metastore
Hive
Metastore
Sentry
Plugin
Metadata
Read/Write
Pig
HCat
Metastore RPCclients can
read/write metadata directly.
Sentry enforces the same
privileges on metadata
MR,
Yarn,
Spark
Policy
Metadata


![]()


![]()
Sentry with Metastore
Enforces the same policies for metadata access
Prevents unauthorized schema changes
Hides metadata from unauthorized users
Works for all Metastore RPC clients
Apache Pig with Hcatalog
Hadoop jobs
Third party applications

Sentry HDFS ACLsycn
Policy
Metadata
NameNode
Sentry
Plugin
Cache
Hive
HDFS
HDFS applies
Sentry privileges
as ACLs for
files/directories
that are part of
Hive data to
enable non-SQL
clients.
BI/Analytics
Apps
MR,
Yarn,
Spark
Metastore
Sentry
Plugin
Hive
Tables
Pig
HCat


![]()

![]()
![]()

![]()
![]()
HDFS ACL sycn for non-sqlclients
Apply Sentry privileges as HDFS ACLs
Requires HDFS extended ACLs enabled
Namenode maintains a cache of privileges
Currently supported for Hive data only
Enables same granularity of access to files for non-sql clients
Hadoop side changes is recently committed only available in trunk

Sentry with Apache Solr
Policy
Metadata
Http
Client
Validate access to Solr
collection and
documents.
Sentry
Plugin
Documents
Indexes

![]()
![]()


![]()
Sentry with Apache Solr
Fine grained authorization
Collection
Documents
Index
Support query and update access on the resources

Sentry with Apache Sqoop
Authorization of various sqoop resources
connector, link, jobs
Fine grained authorization of actions
Create, Enable, Start/Stop, List etc.
Under development
SENTRY-612 being reviewed

Agenda
Various aspects of data security
Apache Sentry for authorization
Key concepts of Apache Sentry
Sentry features
Sentry architecture
Integration with Hadoop ecosystem
Sentry administration
Future plans
Demo
Questions

Sentry Administration
Privileges managed natively by downstream app
Auth SQL statements
Application APIs
Hue UI
Sentry App for policy administration
Plugable groups mapping
By default same as Hadoop (OS or LDAP/AD)

Sentry App in Hue


Sentry App in Hue


Set ing up Sentry in Hadoopcluster
Should have strong authentication like Kerberos or LDAP
Setup sentry service
Setup metadata DB
Configure and run the service
Setup data services to use Sentry
Configure auth plugins
Setup sentry client configuration to use sentry service
Create roles and privileges
Hue UI app is super useful

KEY POINTS
5/5/2021 Chandigarh University 56
●
●
●
●
●
●
●
Various aspects of data security
Apache Sentry for authorization
Key concepts of Apache Sentry
Sentry features
Sentry architecture
Integration with Hadoop ecosystem
Sentry administration


LEARNING MATERIAL
5/5/2021 Chandigarh University 57


ASSESSMENT PATTERN
5/5/2021 Chandigarh University 58


Please Send Your Queries on:
e-Mail: jaspreet.e10279@cumail.in
5/5/2021 Chandigarh University 59

Department of Computer Science & Engineering
BIG DATA SECURITY
(CSC-482)
DR. JASPREET SINGH BATTH
E10279
ASSISTANT PROFESSOR
CSE (AIT), CU
DISCOVER . LEARN . EMPOWER
5/5/2021 Chandigarh University 1


About COURSE
To understand Hadoop Components and HDFS.
To learn about inherent security issues with HDFS.
To learn how Hadoop deals with inherent security issues.
To define Hadoop’s Operational Security Woes.
To learn about Authentication and Authorization.
To learn how to harness Fine grained authorization.
To learn about Hadoop loggings for Security.
To do case study about Ganglia and Nagios.
To know about Encryption of data at rest and in transit.
To study open source authentication in Hadoop
To learn about PuTTY’s Host Keys
To understand key based authentication using PuTTY
5/5/2021 Chandigarh University 2

COURSE OBJECTIVES
CO Number | Title | Level |
CO1 | Describe how the security for Big Data | Understand & |
CO2 | To evaluate the basics of Big Data Security and its case study for Big Data Applications. (Analytics Flow for Big Data) | Understand & |
CO3 | Hadoop Logging, Encryption of data in | Apply |
3
5/5/2021 Chandigarh University

COURSE OUTCOMES
To understand why Authentication and Authorization is
required.
To learn how to analyze security issues with HDFS.
To learn how to securely administering HDFS
5/5/2021 Chandigarh University 4

Agenda
Why Hive???
What is Hive?
Hive Data Model
Hive Architecture
HiveQL
Hive SerDe’s
Pros and Cons
Hive v/s Pig
5/5/2021 Chandigarh University 5

… Enter Hive!


Hive Key Principles


HiveQL to MapReduce
Hive Framework
N
Data Analyst
SELECT COUNT(1) FROM Sales;
rowcount, N
rowcount,1
rowcount,1
Sales: Hive table
MR JOB Instance



![]()
![]()
Hive Data Model
Data in Hive organized into :
Tables
Partitions
Buckets

Hive Data Model Contd.
Tables
Analogous to relational tables
Each table has a corresponding directory in HDFS
Data serialized and stored as files within that directory
Hive has default serialization built in which supports compression and
lazy deserialization
Users can specify custom serialization –deserialization schemes
(SerDe’s)

Hive Data Model Contd.
Partitions
Each table can be broken into partitions
Partitions determine distribution of data within subdirectories
Example -
CREATE_TABLE Sales (sale_id INT, amount FLOAT)
PARTITIONED BY (country STRING, year INT, month INT)
So each partition will be split out into different folders like
Sales/country=US/year=2012/month=12

Hierarchy of Hive Partitions
/hivebase/Sales
/country=CANADA
/country=US
/year=2012
/year=2015
/year=2012
/month=11
/month=11
/month=12
/year=2014
File File File

![]()
![]()
![]()
Hive Data Model Contd.
Buckets
Data in each partition divided into buckets
Based on a hash function of the column
H(column) mod NumBuckets = bucket number
Each bucket is stored as a file in partition directory

Architecture
Externel Interfaces- CLI, WebUI, JDBC,
ODBC programming interfaces
Thrift Server – Cross Language service
framework .
Metastore - Meta data about the Hive
tables, partitions
Driver - Brain of Hive! Compiler, Optimizer
and Execution engine


Hive Thrift Server
Framework for cross language services
Server written in Java
Support for clients written in different languages
- JDBC(java), ODBC(c++), php, perl, python scripts


Metastore
System catalog which contains metadata about the Hive tables
Stored in RDBMS/local fs. HDFS too slow(not optimized for random access)
Objects of Metastore
Database - Namespace of tables
Table - list of columns, types, owner, storage, SerDes
Partition – Partition specific column, Serdes and storage


Hive Driver
Driver - Maintains the lifecycle of HiveQL statement
Query Compiler – Compiles HiveQL in a DAG of map reduce tasks
Executor - Executes the tasks plan generated by the compiler in proper
dependency order. Interacts with the underlying Hadoop instance


Compiler
Converts the HiveQL into a plan for execution
Plans can
Metadata operations for DDL statements e.g. CREATE
HDFS operations e.g. LOAD
Semantic Analyzer – checks schema information, type checking,
implicit type conversion, column verification
Optimizer – Finding the best logical plan e.g. Combines multiple joins
in a way to reduce the number of map reduce jobs, Prune columns
early to minimize data transfer
Physical plan generator – creates the DAG of map-reduce jobs

HiveQL
DDL :
CREATE DATABASE
CREATE TABLE
ALTER TABLE
SHOW TABLE
DESCRIBE
DML:
QUERY:
LOAD TABLE
INSERT
SELECT
GROUP BY
JOIN
MULTI TABLE INSERT

Hive SerDe
SELECT Query
Object
Inspector
Map Fields
Deserialize
Record
Reader
Hive built in Serde:
Avro, ORC, Regex etc
End User
Hive Row
Object
Can use Custom
SerDe’s (e.g. for
unstructured data
like audio/video
data, semistructured
XML data)
Hive Table

![]()
![]()
![]()
Good Things
Boon for Data Analysts
Easy Learning curve
Completely transparent to underlying Map-Reduce
Partitions(speed!)
Flexibility to load data from localFS/HDFS into Hive Tables

Cons and Possible Improvements
Extending the SQL queries support(Updates, Deletes)
Parallelize firing independent jobs from the work DAG
Table Statistics in Metastore
Explore methods for multi query optimization
Perform N- way generic joins in a single map reduce job
Better debug support in shell

Hive v/s Pig
Similarities:
Both High level Languages which work on top of map reduce framework
Can coexist since both use the under lying HDFS and map reduce
Differences:
Language
Pig is a procedural ; (A = load ‘mydata’; dump A)
Hive is Declarative (select * from A)
Work Type
Pig more suited for adhoc analysis (on demand analysis of click stream
search logs)
Hive a reporting tool (e.g. weekly BI reporting)



Hive v/s Pig
Differences:
Users
Pig – Researchers, Programmers (build complex data pipelines,
machine learning)
Hive – Business Analysts
Integration
Pig - Doesn’t have a thrift server(i.e no/limited cross language support)
Hive - Thrift server
User’s need
Pig – Better dev environments, debuggers expected
Hive - Better integration with technologies expected(e.g JDBC, ODBC)



Head-to-Head
(the bee, the pig, the elephant)
Version: Hadoop – 0.18x, Pig:786346, Hive:786346


KEY POINTS
Why Hive???
What is Hive?
Hive Data Model
Hive Architecture
HiveQL
Hive SerDe’s
Pros and Cons
Hive v/s Pig
5/5/2021 Chandigarh University 26


LEARNING MATERIAL
5/5/2021 Chandigarh University 27


ASSESSMENT PATTERN
5/5/2021 Chandigarh University 28


Please Send Your Queries on:
e-Mail: jaspreet.e10279@cumail.in
5/5/2021 Chandigarh University 29

Department of Computer Science & Engineering
BIG DATA SECURITY
(CSC-482)
DR. JASPREET SINGH BATTH
E10279
ASSISTANT PROFESSOR
CSE (AIT), CU
DISCOVER . LEARN . EMPOWER
5/5/2021 Chandigarh University 1


About COURSE
To understand Hadoop Components and HDFS.
To learn about inherent security issues with HDFS.
To learn how Hadoop deals with inherent security issues.
To define Hadoop’s Operational Security Woes.
To learn about Authentication and Authorization.
To learn how to harness Fine grained authorization.
To learn about Hadoop loggings for Security.
To do case study about Ganglia and Nagios.
To know about Encryption of data at rest and in transit.
To study open source authentication in Hadoop
To learn about PuTTY’s Host Keys
To understand key based authentication using PuTTY
5/5/2021 Chandigarh University 2

COURSE OBJECTIVES
CO Number | Title | Level |
CO1 | Describe how the security for Big Data | Understand & |
CO2 | To evaluate the basics of Big Data Security and its case study for Big Data Applications. (Analytics Flow for Big Data) | Understand & |
CO3 | Hadoop Logging, Encryption of data in | Apply |
3
5/5/2021 Chandigarh University

COURSE OUTCOMES
To understand why Authentication and Authorization is
required.
To learn how to analyze security issues with HDFS.
To learn how to securely administering HDFS
5/5/2021 Chandigarh University 4

CONTENTS TO BE COVERED
Various aspects of data security
Apache Sentry for authorization
Key concepts of Apache Sentry
Sentry features
Sentry architecture
Integration with Hadoop ecosystem
Sentry administration
5/5/2021 Chandigarh University 5

Impala


Impala: Goals
General-purpose SQL query engine for Hadoop
High performance
C++ implementation
runtime code generation (using LLVM)
direct data access (no MapReduce jobs)
Run directly on Hadoop
read the same file formats
use the same storage managers (Hive metastore)
daemons on the same nodes that run Hadoop processes

Data formats
Supported HDFS file formats
Parquet
Text
Avro*
RCFile*
SequenceFile*
* no inserts, use Hive for that
Querying HBase tables possible
Querying Amazon S3 Filesystem in test phase

User interfaces
impala-shell for interactive commands
Apache Hue as web-based user interface
JDBC and ODBC to connect from applications
or as external database from Oracle

Components
impala daemon (impalad)
one pre node
accepts queries, distributes work, transfers results back to the coordinator
node
impala statestore (statestored)
one pre cluster
monitors health of impala daemons
impala catalog service (catalogd)
one pre cluster
transfers metadata changes from impala sql statements

Query execution
Hive metastore
Application
ODBC
impala-shell
Result
SQL
Query Planner
Query Planner
Query Planner
Query Coordinator
Query Coordinator
Query Coordinator
Query Executor
Query Executor
Query Executor
HDFS
HDFS
HDFS


Impala metadata and Hive metastore
table definitions in shared Hive metastore
impala tracks additional metadata inc.:
physical location of blocks in HDFS
after external changes (through Hive or manually to files) metadata
needs to be updated
REFRESH table_name, INVALIDATE_METADATA

Hands on: Impala & Hive metastore
Create table in Impala
check if it’s accessible in Hive
check content of default Hive folder
try inserting
Vice versa. Create table in Hive
check if it’s accessible in Impala
try inserting
commands: http://cern.ch/kacper/impala1.txt

Query optimizer
Commands available for performance tuning
EXPLAIN SELECT… - steps that a query will perform
SUMMARY – report about the last executed query
PROFILE – like SUMMARY but more detailed and
low-level information
Table statistics are stored in Metastore
can be viewed using
SHOW TABLE STATS table_name
SHOW COLUMN STATS table_name
if missing, use
COMPUTE STATS table_name

Introduction to


Agenda
Overview ✸
Brief History
Why Solr?
Building Blocks of Solr
Solr Schema Hierarchy
Installation
Solr Home Directory
Common Query Parameters
Result Grouping
Field-Value Faceting
Demo


Overview
Apache Solr is a popular, open source enterprise
search platform built on the Java based search
engine library Apache Lucene.
Solr powers the search and navigation features of
many of the world's largest companies like Netflix,
Instagram, Linkedin, Twitter and eBay, etc.


Agenda
Overview ✓
Brief History ✸
Why Solr?
Building Blocks of Solr
Solr Schema Hierarchy
Installation
Solr Home Directory
Common Query Parameters
Result Grouping
Field-Value Faceting
Demo


Brief History
2004: Solr was created by Yonik Seeley at CNET Networks (In house)
2006: CNET Networks donated it as an open source project to the Apache Software
Foundation
2008: Solr 1.3 was released including distributed search capabilities
2009: Solr 1.4 with enhancements in indexing, searching and faceting
2010-11: Apache projects Lucene and Solr were merged. And to keep both on same
version number, the next release after Solr 1.4 was labeled as Solr 3.1
2012: Released Solr 4.0 with the new SolrCloud feature.
2015: Released Solr 5.0 where Solr was packaged as a standalone application - No
more need to deploy as war.
2016: Release Solr 6.0 with support for Parallel SQL queries
Current stable release: 6.6.0 / June 6, 2017


Agenda
Overview ✓
Brief History ✓
Why Solr? ✸
Building Blocks of Solr
Solr Schema Hierarchy
Installation
Solr Home Directory
Common Query Parameters
Result Grouping
Field-Value Faceting
Demo


Why Solr?
Uses the Lucene library for advanced full-text
search
XML, JSON and HTTP support
Comprehensive Admin UI
Distributed Search through Sharding - enables
scaling content volume
Easy configuration
Queries, filters, and documents
And a lot more...


Agenda
Overview ✓
Brief History ✓
Why Solr? ✓
Building Blocks of Solr ✸
Solr Schema Hierarchy
Installation
Solr Home Directory
Common Query Parameters
Result Grouping
Field-Value Faceting
Demo


Building Blocks of Solr
Request Handler
– e.g. /select, /update
Search Component
– e.g. query, faceting, grouping
Query Parser
verifies the queries for syntactical errors
translates them to a format which Lucene understands.
Response Writer
– e.g. XML, JSON, CSV
Analyzer/tokenizer
Analyzer: examines the text of fields and generates a token stream.
Tokenizer: Breaks the token stream into tokens.
Update Request Processor
– responsible for modifications such as dropping a field, adding a field


Agenda
Overview ✓
Brief History ✓
Why Solr? ✓
Building Blocks of Solr ✓
Solr Schema Hierarchy ✸
Installation
Solr Home Directory
Common Query Parameters
Result Grouping
Field-Value Faceting
Demo


Solr Schema Hierarchy



Agenda
Overview ✓
Brief History ✓
Why Solr? ✓
Building Blocks of Solr ✓
Solr Schema Hierarchy ✓
Installation ✸
Solr Home Directory
Common Query Parameters
Result Grouping
Field-Value Faceting
Demo


Installation
Required: Java 8+ and Latest version of JVM
Download and extract Apache Solr 6.6.0
Go to the directory using terminal and run Solr
using
$SOLR_HOME/bin/solr start
OR
$SOLR_HOME/bin/solr -e techproducts


Agenda
Overview ✓
Brief History ✓
Why Solr? ✓
Building Blocks of Solr ✓
Solr Schema Hierarchy ✓
Installation ✓
Solr Home Directory ✸
Common Query Parameters
Result Grouping
Field-Value Faceting
Demo


Solr Home Directory
●
/conf
solrconfig.xml
schema.xml
●
/data
Data store for Solr index
●• /lib (optional)
Used to load external jars for resolving any
plugins specified in solrconfig.xml or schema.xml


Agenda
Overview ✓
Brief History ✓
Why Solr? ✓
Building Blocks of Solr ✓
Solr Schema Hierarchy ✓
Installation ✓
Solr Home Directory ✓
Common Query Parameters ✸
Result Grouping
Field-Value Faceting
Demo


Common Query Parameters
The table below summarizes Solr's common
query parameters, which are supported by the
Search RequestHandlers
Parameter Description
wt Response Writer (JSON/XML/CSV)
q Main query (Mandatory *:* if no main query is specified)
fq Filter query (Applied over output of main query)
fl
start
rows
List of fields to output
Offset (default 0)
Number of rows after offset (default 10)


Agenda
Overview ✓
Brief History ✓
Why Solr? ✓
Building Blocks of Solr ✓
Solr Schema Hierarchy ✓
Installation ✓
Solr Home Directory ✓
Common Query Parameters ✓
Result Grouping ✸
Field-Value Faceting
Demo


Result Grouping
Result Grouping groups documents with a
common field value into groups and returns
the top n documents for each group.
Parameter
group
group.field
start
rows
group.limit
group.ngroups
Description
If true, returns grouped response
Field which is the grouping criteria
initial offset (Default 0)
Number of groups to return (Default 10)
Number of results (n) in each group (Default 1. To get all
documents, set this value as -1)
If true, also returns total number of groups that have matched
the query. (default false)


Agenda
Overview ✓
Brief History ✓
Why Solr? ✓
Building Blocks of Solr ✓
Solr Schema Hierarchy ✓
Installation ✓
Solr Home Directory ✓
Common Query Parameters ✓
Result Grouping ✓
Field-Value Faceting ✸
Demo


Field-Value Faceting
A facet is one side or aspect of something.
Faceted search allows users to explore the
multiple faces of a field.
Several parameters can be used to trigger
faceting based on the indexed terms in a field.
Parameter Description
facet If true, enables faceting
facet.field Identifies the faceting field


KEY POINTS
5/5/2021 Chandigarh University 36
●
●
●
●
●
●
●
Various aspects of data security
Apache Sentry for authorization
Key concepts of Apache Sentry
Sentry features
Sentry architecture
Integration with Hadoop ecosystem
Sentry administration


LEARNING MATERIAL
5/5/2021 Chandigarh University 37


ASSESSMENT PATTERN
5/5/2021 Chandigarh University 38


Please Send Your Queries on:
e-Mail: jaspreet.e10279@cumail.in
5/5/2021 Chandigarh University 39

Department of Computer Science & Engineering
BIG DATA SECURITY
(CSC-482)
DR. JASPREET SINGH BATTH
E10279
ASSISTANT PROFESSOR
CSE (AIT), CU
DISCOVER . LEARN . EMPOWER
5/5/2021 Chandigarh University 1


About COURSE
To understand Hadoop Components and HDFS.
To learn about inherent security issues with HDFS.
To learn how Hadoop deals with inherent security issues.
To define Hadoop’s Operational Security Woes.
To learn about Authentication and Authorization.
To learn how to harness Fine grained authorization.
To learn about Hadoop loggings for Security.
To do case study about Ganglia and Nagios.
To know about Encryption of data at rest and in transit.
To study open source authentication in Hadoop
To learn about PuTTY’s Host Keys
To understand key based authentication using PuTTY
5/5/2021 Chandigarh University 2

COURSE OBJECTIVES
CO Number | Title | Level |
CO1 | Describe how the security for Big Data | Understand & |
CO2 | To evaluate the basics of Big Data Security and its case study for Big Data Applications. (Analytics Flow for Big Data) | Understand & |
CO3 | Hadoop Logging, Encryption of data in | Apply |
3
5/5/2021 Chandigarh University

COURSE OUTCOMES
To understand why Authentication and Authorization is
required.
To learn how to analyze security issues with HDFS.
To learn how to securely administering HDFS
5/5/2021 Chandigarh University 4

CONTENTS TO BE COVERED
Introduction to Audit Logging
Security monitoring
Understanding loggers and appenders using Log4j API
Hadoop audit logs and daemon logs
More on Hadoop audit logs and daemon logs
5/5/2021 Chandigarh University 5

Security Audit
"The world isn’t run by weapons anymore, or
energy, or money. It’s run by little ones and zeros,
little bits of data... There’s a war out there... and
it’s not about who’s got the most bullets. It’s
about who controls the information.“
Federation of American Scientists - Intelligence Resource Program

Workshop Outline (2)
Security Audit

FAQ
We already have firewalls in place. Isn't that enough?
We did not realize we could get security audits. Can you really get
security audits, just like financial audits?
We have already had a security audit. Why do we need another
one?

Answers
Firewalls and other devices are simply tools to
help provide security. They do not, by themselves,
provide security. Using a castle as an analogy,
think of firewalls and other such tools as simply
the walls and watch towers. Without guards,
reports, and policies and procedures in place, they
provide little protection.
Security audits, like financial audits should be
performed on a regular basis.

Security Audit-Definitions
A security audit is a policy-based assessment of
the procedures and practices of a site, assessing
the level of risk created by these actions
A assessment process, which will develop systems
and procedures within an organization, create
awareness amongst the employees and users and
ensure compliance with legislation through
periodic checking of processes, constituents and
documentation.

Why Audit?
Determine Vulnerable Areas
Obtain Specific Security Information
Allow for Remediation
Check for Compliance
Ensure Ongoing Security
To ensure that the site’s
networks and systems are
efficient and foolproof

Who needs security auditing?
A security audit is necessary for every organization using
the Internet.
A ongoing process that must be tried and improved to cope
up with the ever-changing and challenging threats.
Should not be feared of being audited. Audit is good
practice.

Audit Phases
External Audit
Public information collection
External Penetration
Non-destructive test
Destructive test
Internal Audit
Confidential information collection
Security policy reviewing
Interviews
Environment and Physical Security
Internal Penetration
Change Management
Reporting

Audit Phases-External
Hackers view of the network
Simulate attacks from outside
Point-in-time snapshots
Can NEVER be 100%

External Audit-Public Information Gathering
Search for information about the target and its
critical services provided on the Internet.
Network Identification
Identify IP addresses range owned/used
Network Fingerprinting
Try to map the network topology
Perimeter models identifications
OS & Application fingerprinting
OS finger printing
Port scanning to define services and application
Banner grabbing

External Audit - Some Commandments
Do not make ANY changes to the systems or
networks
Do not impact processing capabilities by running
scanning/ testing tools during business hours or
during peak or critical periods
Always get permission before testing
Be confidential and trustworthy
Do not perform unnecessary attacks

External Audit-Penetration Test
Plan the penetration process
Search for vulnerabilities for information gathered and obtain the
exploits
Conduct vulnerabilities assessments (ISO 17799)
Non-destructive test
Scans / test to confirm vulnerabilities
Make SURE not harmful
Destructive test
Only for short term effect (DDOS….)
Done from various locations
Done only off-peak hours to confirm effect
Record everything
Save snapshots and record everything for every test done even it
returned false result
Watch out for HONEYPOTS

Internal Audit
Conducted at the premises
A process of hacking with full knowledge of the
network topology and other crucial information.
Also to identify threats within the organization
Should be 100% accurate.
Must be cross checked with external penetration
report.

Internal Audit-Policy review
Policy
Standards
Procedures, Guidelines
& Practices
Everything starts
with the security
policy
If there is no
policy, there is
not need of
security audit.

Internal Audit-Policy review
Policies are studied properly and classified
Identify any security risk exist within the policy
Interview IT staffs to gain proper understanding of
the policies
Also to identify the level of implementation of the
policies.

Internal Audit-Information gathering
Discussion of the network topology
Placement of perimeter devices of routers and
firewalls
Placement of mission critical servers
Existence of IDS
Logging



Internal Audit-Environment &
Physical Security
Locked / combination / card swipe doors
Temperature / humidity controls
Neat and orderly computing rooms
Sensitive data or papers laying around?
Fire suppression equipment
UPS (Uninterruptible power supply)
Section 8.1 of the ISO 17799 document defines
the concepts of secure area, secure perimeter
and controlled access to such areas.



Internal Audit-Penetration
For Internal penetration test, it can divided to few
categories
Network
Perimeter devices
Servers and OS
Application and services
Monitor and response
Find vulnerabilities and malpractice in each category



Internal Audit-Network
Location of devices on the network
Redundancy and backup devices
Staging network
Management network
Monitoring network
Other network segmentation
Cabling practices
Remote access to the network



Internal Audit-Perimeter Devices
Check configuration of perimeter devices like
Routers
Firewalls
Wireless AP/Bridge
RAS servers
VPN servers
Test the ACL and filters like egress and ingress
Firewall rules
Configuration Access method
Logging methods



Internal Audit-Server & OS
Identify mission critical servers like DNS,Email and others..
Examine OS and the patch levels
Examine the ACL on each servers
Examine the management control-acct & password
Placement of the servers
Backup and redundancy



Internal Audit-Application & Services
Identify services and application running on the critical
mission servers.Check vulnerabilities for the versions
running.Remove unnecessary services/application
DNS
Name services(BIND)
Pop3,SMTP
Web/Http
SQL
Others



Internal Audit-Monitor & Response
Check for procedures on
Event Logging and Audit
What are logged?
How frequent logs are viewed?
How long logs are kept?
Network monitoring
What is monitored?
Response Alert?
Intrusion Detection
IDS in place?
What rules and detection used?
Incident Response
How is the response on the attack?
What is recovery plan?
Follow up?



Internal Audit-Analysis and Report
Analysis result
Check compliance with security policy
Identify weakness and vulnerabilities
Cross check with external audit report
Report- key to realizing value
Must be 2 parts
Not technical (for management use)
Technical (for IT staff)
Methodology of the entire audit process
Separate Internal and External
State weakness/vulnerabilities
Suggest solution to harden security

Tools


More Tools….
Inetmon
Firewalk
Dsniff
RafaleX
NetStumbler
RAT (Router Audit Tool)-CIS
Retina scan tools
MBSA

KEY POINTS
Introduction to Audit Logging
Security monitoring
Understanding loggers and appenders using Log4j API
Hadoop audit logs and daemon logs
More on Hadoop audit logs and daemon logs
5/5/2021 Chandigarh University 32


LEARNING MATERIAL
5/5/2021 Chandigarh University 33


ASSESSMENT PATTERN
5/5/2021 Chandigarh University 34


Please Send Your Queries on:
e-Mail: jaspreet.e10279@cumail.in
5/5/2021 Chandigarh University 35

Department of Computer Science & Engineering
BIG DATA SECURITY
(CSC-482)
DR. JASPREET SINGH BATTH
E10279
ASSISTANT PROFESSOR
CSE (AIT), CU
DISCOVER . LEARN . EMPOWER
5/5/2021 Chandigarh University 1


About COURSE
To understand Hadoop Components and HDFS.
To learn about inherent security issues with HDFS.
To learn how Hadoop deals with inherent security issues.
To define Hadoop’s Operational Security Woes.
To learn about Authentication and Authorization.
To learn how to harness Fine grained authorization.
To learn about Hadoop loggings for Security.
To do case study about Ganglia and Nagios.
To know about Encryption of data at rest and in transit.
To study open source authentication in Hadoop
To learn about PuTTY’s Host Keys
To understand key based authentication using PuTTY
5/5/2021 Chandigarh University 2

COURSE OBJECTIVES
CO Number | Title | Level |
CO1 | Describe how the security for Big Data | Understand & |
CO2 | To evaluate the basics of Big Data Security and its case study for Big Data Applications. (Analytics Flow for Big Data) | Understand & |
CO3 | Hadoop Logging, Encryption of data in | Apply |
3
5/5/2021 Chandigarh University

COURSE OUTCOMES
To understand why Authentication and Authorization is
required.
To learn how to analyze security issues with HDFS.
To learn how to securely administering HDFS
5/5/2021 Chandigarh University 4

CONTENTS TO BE COVERED
Introduction to Audit Logging
Security monitoring
Understanding loggers and appenders using Log4j API
Hadoop audit logs and daemon logs
More on Hadoop audit logs and daemon logs
5/5/2021 Chandigarh University 5

Security Audit Terminology
An independent review and examination of a system's records and
activities
To determine the adequacy of system controls
To ensure compliance with established security policy and procedures, detect
breaches in security services,
To recommend any changes that are indicated for countermeasures
Objectives: to establish accountability for system entities that initiate
or participate in security-relevant events and actions

Security Audit Trail
A chronological record of system activities that is sufficient to
enable the reconstruction and examination of the sequence of
environments and activities surrounding or leading to an
operation, procedure, or event in a security-relevant transaction
from inception to final results

Security Audit Architecture


Distributed Audit Trail Model


Security
Auditing
Functions


Security Audit Functions
Data generation: Identifies the level of auditing, enumerates the
types of auditable events
Event selection: Inclusion or exclusion of events from the
auditable set
Event storage: Creation and maintenance of the secure audit trail
Automatic response: reactions taken if detect a possible security
violation event
Audit analysis: automated mechanisms to analyze audit data in
search of security violations
Audit review: available to authorized users to assist in audit data
review

Event Definition: Requirement
Must define what are auditable events
Common Criteria suggests:
introduction of objects
deletion of objects
distribution or revocation of access rights or capabilities
changes to subject or object security attributes
policy checks performed by the security software
use of access rights to bypass a policy check
use of identification and authentication functions
security-related actions taken by an operator/user
import/export of data from/to removable media

Other Audit Requirements
Event detection hooks in software and monitoring software to
capture activity
Event recording function with secure storage
Event and audit trail analysis software, tools, and interfaces
Security of the auditing function: data but also software and storage
must be protected
Minimal effect on functionality

Implementation Requirements (ISO)
agree on requirements management
scope of checks agreed and controlled
checks limited to read-only access to s/w & data
other access only for isolated copies of system files, then erased
or given appropriate protection
resources for performing the checks should be explicitly identified
and made available
identify/agreed on special processing requirements
all access should be monitored and logged
document procedures,requirements,responsibilities
person(s) doing audit independent of activities

What to Collect
Data items captured may include:
auditing software use
use of system security mechanisms
events from IDS and firewall systems
system management/operation events
operating system access (system calls)
access to selected applications
remote access
A common concern: the of amount of data generated

Auditable
Items
Suggested
in X.816

Examples of System-Level Audit Trails
Useful to categorize audit trails
System-level audit trails:
Captures logins, device use, O/S functions, e.g.
Jan | 27 | 17:18:38 | host1 | login: ROOT LOGIN console | |
Jan | 27 | 17:19:37 | host1 | reboot: rebooted by root | |
Jan | 28 | 09:46:53 | host1 | su: 'su root' succeeded for | user1 |
on /dev/ttyp0
Jan 28 09:47:35 host1 shutdown: reboot by user1

Example of Application-Level Audit Trails
To detect security violations within an application
To detect flaws in application's system interaction
For critical/sensitive applications, e.g. email, DB
email: sender, receiver, email size
database: queries, table insertion and removal
Record appropriate security related details, e.g.
Apr 911:20:22 host1 AA06370: from=<user2@host2>,
size=3355, class=0
Apr 911:20:23 host1 AA06370: to=<user1@host1>,
delay=00:00:02,stat=Sent
Apr 911:59:51 host1 AA06436: from=<user4@host3>,
size=1424, class=0
Apr 911:59:52 host1 AA06436: to=<user1@host1>,
delay=00:00:02, stat=Sent

User-Level Audit Trails
Trace activity of individual users over time
to hold user accountable for actions taken
as input to an analysis program that attempts to define normal versus
anomalous behavior
May capture
user interactions with system
e.g. commands issued
identification and authentication attempts
files and resources accessed
may also log use of applications

Physical-Level Audit Trails
Generated by physical access controls
e.g. card-key systems, alarm systems
Sent to central host for analysis/storage
Can log
date/time/location/user of access attempt
both valid and invalid access attempts
attempts to change access privileges
may send violation messages to personnel

Audit Trail Storage Alternatives
Read/write file on host
easy, least resource use, fast access
vulnerable to attack by intruder
Write-once device (e.g. CD/DVD-ROM)
more secure but less convenient
need media supply and have delayed access
Write-only device (e.g. printer)
paper-trail but impractical for analysis
Must protect both integrity and confidentiality
e.g., change pay level, or rank
using encryption, digital signatures, access controls

Implementing Logging
Foundation of security auditing facility is the initial capture of the
audit data
Software must include hooks (capture points) that trigger data
collection and storage as preselected events occur
Operating system/application dependent

Windows Event Log
Each event an entity that describes some interesting occurrence and
each event record contains: numeric id, set of attributes, optional user data
Presented as XML or binary data
Three types of event logs:
system: system related apps & drivers
application: user-level apps
security: for Local Sec Authority (LSA) only

Windows Event Log Example
Event Type: Success Audit
Event Source: Security Event
Category: (1)
Event ID: 517
Date: 3/6/2006
Time: 2:56:40 PM
User: NT AUTHORITY\SYSTEM
Computer: KENT
Description: The audit log was cleared
Primary User Name: SYSTEM
Primary Domain: NT AUTHORITY
Primary Logon ID: (0x0,0x3F7)
Client User Name: userk
Client Domain: KENT
Client Logon ID: (0x0,0x28BFD)

Windows Event Categories
Account logon events: acceptance/rejection of authentication
Account management: account creation/deletion
Directory service access: user access to active dir (that has a
system access control defined)
Logon events: user log in/log off, bad password
Object access: same as DSL but to registry and similar
Policy changes: admin changes to access policies
Privilege use: user right changes
Process tracking: start and termination
System events: start, reboot, shut down

KEY POINTS
Introduction to Audit Logging
Security monitoring
Understanding loggers and appenders using Log4j API
Hadoop audit logs and daemon logs
More on Hadoop audit logs and daemon logs
5/5/2021 Chandigarh University 26


LEARNING MATERIAL
5/5/2021 Chandigarh University 27


ASSESSMENT PATTERN
5/5/2021 Chandigarh University 28


Please Send Your Queries on:
e-Mail: jaspreet.e10279@cumail.in
5/5/2021 Chandigarh University 29

Department of Computer Science & Engineering
BIG DATA SECURITY
(CSC-482)
DR. JASPREET SINGH BATTH
E10279
ASSISTANT PROFESSOR
CSE (AIT), CU
DISCOVER . LEARN . EMPOWER
5/5/2021 Chandigarh University 1


About COURSE
To understand Hadoop Components and HDFS.
To learn about inherent security issues with HDFS.
To learn how Hadoop deals with inherent security issues.
To define Hadoop’s Operational Security Woes.
To learn about Authentication and Authorization.
To learn how to harness Fine grained authorization.
To learn about Hadoop loggings for Security.
To do case study about Ganglia and Nagios.
To know about Encryption of data at rest and in transit.
To study open source authentication in Hadoop
To learn about PuTTY’s Host Keys
To understand key based authentication using PuTTY
5/5/2021 Chandigarh University 2

COURSE OBJECTIVES
CO Number | Title | Level |
CO1 | Describe how the security for Big Data | Understand & |
CO2 | To evaluate the basics of Big Data Security and its case study for Big Data Applications. (Analytics Flow for Big Data) | Understand & |
CO3 | Hadoop Logging, Encryption of data in | Apply |
3
5/5/2021 Chandigarh University

COURSE OUTCOMES
To understand why Authentication and Authorization is
required.
To learn how to analyze security issues with HDFS.
To learn how to securely administering HDFS
5/5/2021 Chandigarh University 4

CONTENTS TO BE COVERED
Introduction to Audit Logging
Security monitoring
Understanding loggers and appenders using Log4j API
Hadoop audit logs and daemon logs
More on Hadoop audit logs and daemon logs
5/5/2021 Chandigarh University 5

Log4J Framework


Agenda
Logging Brief
Logging Pros & Cons
Log4j Framework
Target Audience
Installation
Log4J Framework Architecture
Logging Levels
Filters
Log4J Configuration Files
Demo
Questions

Logging Brief
Logging is the act of recording events taking
place in the execution of a system in order to
provide an audit trail that can be used to
understand the activity of the system and to
diagnose problems.
Check below link for further details:
Log4j Tutorial

Pros & Cons


Log4j Framework


Target Audience


Installation


Architecture


Architecture –cont.


Architecture – Support Objects


Log4j Configuration files


Log4J Configuration Cont.


KEY POINTS
Introduction to Audit Logging
Security monitoring
Understanding loggers and appenders using Log4j API
Hadoop audit logs and daemon logs
More on Hadoop audit logs and daemon logs
5/5/2021 Chandigarh University 18


LEARNING MATERIAL
5/5/2021 Chandigarh University 19


ASSESSMENT PATTERN
5/5/2021 Chandigarh University 20


Please Send Your Queries on:
e-Mail: jaspreet.e10279@cumail.in
5/5/2021 Chandigarh University 21


Hadoop Security
Ben Spivey & Joey Echeverria
Hadoop Security
by Ben Spivey and Joey Echeverria
Copyright © 2015 Joseph Echeverria and Benjamin Spivey. All rights
reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North,
Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales
promotional use. Online editions are also available for most titles
(http://safaribooksonline.com). For more information, contact our
corporate/institutional sales department: 800-998-9938 or
corporate@oreilly.com.
Editors: Ann Spencer and Marie Beaugureau
Production Editor: Melanie Yarbrough
Copyeditor: Gillian McGarvey
Proofreader: Jasmine Kwityn
Indexer: Wendy Catalano
Interior Designer: David Futato
Cover Designer: Ellie Volkhausen
Illustrator: Rebecca Demarest
July 2015: First Edition
Revision History for the First Edition
2015-06-24: First Release
See http://oreilly.com/catalog/errata.csp?isbn=9781491900987 for release
details.
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Hadoopthe cover image, and related trade dress are trademarks of O’Reilly
Media, Inc.
While the publisher and the authors have used good faith efforts to ensure that
the information and instructions contained in this work are accurate, the
publisher and the authors disclaim all responsibility for errors or omissions,
including without limitation responsibility for damages resulting from the use
of or reliance on this work. Use of the information and instructions contained
in this work is at your own risk. If any code samples or other technology this
work contains or describes is subject to open source licenses or the
intellectual property rights of others, it is your responsibility to ensure that
your use thereof complies with such licenses and/or rights.
978-1-491-90098-7
[LSI]
It has not been very long since the phrase “Hadoop security” was an
oxymoron. Early versions of the big data platform, built and used at web
companies like Yahoo! and Facebook, didn’t try very hard to protect the data
they stored. They didn’t really have to—very little sensitive data went into
Hadoop. Status updates and news stories aren’t attractive targets for bad
guys. You don’t have to work that hard to lock them down.
As the platform has moved into more traditional enterprise use, though, it has
begun to work with more traditional enterprise data. Financial transactions,
personal bank account and tax information, medical records, and similar
kinds of data are exactly what bad guys are after. Because Hadoop is now
used in retail, banking, and healthcare applications, it has attracted the
attention of thieves as well.
And if data is a juicy target, big data may be the biggest and juiciest of all.
Hadoop collects more data from more places, and combines and analyzes it
in more ways than any predecessor system, ever. It creates tremendous value
in doing so.
Clearly, then, “Hadoop security” is a big deal.
This book, written by two of the people who’ve been instrumental in driving
security into the platform, tells the story of Hadoop’s evolution from its early,
wide open consumer Internet days to its current status as a trusted place for
sensitive data. Ben and Joey review the history of Hadoop security, covering
its advances and its evolution alongside new business problems. They cover
topics like identity, encryption, key management and business practices, and
discuss them in a real-world context.
It’s an interesting story. Hadoop today has come a long way from the
software that Facebook chose for image storage a decade ago. It offers much
more power, many more ways to process and analyze data, much more scale,
and much better performance. Therefore it has more pieces that need to be
secured, separately and in combination.
The best thing about this book, though, is that it doesn’t merely describe. It
prescribes. It tells you, very clearly and with the detail that you expect from
seasoned practitioners who have built Hadoop and used it, how to manage
your big data securely. It gives you the very best advice available on how to
analyze, process, and understand data using the state-of-the-art platform—
and how to do so safely.
Mike Olson,
Chief Strategy Officer and Cofounder, Cloudera, Inc.
Apache Hadoop is still a relatively young technology, but that has not limited
its rapid adoption and the explosion of tools that make up the vast ecosystem
around it. This is certainly an exciting time for Hadoop users. While the
opportunity to add value to an organization has never been greater, Hadoop
still provides a lot of challenges to those responsible for securing access to
data and ensuring that systems respect relevant policies and regulations.
There exists a wealth of information available to developers building
solutions with Hadoop and administrators seeking to deploy and operate it.
However, guidance on how to design and implement a secure Hadoop
deployment has been lacking.
This book provides in-depth information about the many security features
available in Hadoop and organizes it using common computer security
concepts. It begins with introductory material in the first chapter, followed by
material organized into four larger parts: Part I, Security Architecture; Part
II, Authentication, Authorization, and Accounting; Part III, Data Security; and
Part IV, PUtting It All Together. These parts cover the early stages of
designing a physical and logical security architecture all the way through
implementing common security access controls and protecting data. Finally,
the book wraps up with use cases that gather many of the concepts covered in
the book into real-world examples.
This book targets Hadoop administrators charged with securing their big data
platform and established security architects who need to design and integrate
a Hadoop security plan within a larger enterprise architecture. It presents
many Hadoop security concepts including authentication, authorization,
accounting, encryption, and system architecture.
Chapter 1 includes an overview of some of the security concepts used
throughout this book, as well as a brief description of the Hadoop ecosystem.
If you are new to Hadoop, we encourage you to review Hadoop Operations
and Hadoop: The Definitive Guide as needed. We assume that you are
familiar with Linux, computer networks, and general system architecture. For
administrators who do not have experience with securing distributed
systems, we provide an overview in Chapter 2. Practiced security architects
might want to skip that chapter unless they’re looking for a review. In
general, we don’t assume that you have a programming background, and try
to focus on the architectural and operational aspects of implementing Hadoop
security.
The following typographical conventions are used in this book:
Italic
Indicates new terms, URLs, email addresses, filenames, and file
extensions.
Constant width
Used for program listings, as well as within paragraphs to refer to
program elements such as variable or function names, databases, data
types, environment variables, statements, and keywords.
Constant width bold
Shows commands or other text that should be typed literally by the user.
Constant width italic
Shows text that should be replaced with user-supplied values or by
values determined by context.
TIP
This element signifies a tip or suggestion.
NOTE
This element signifies a general note.
WARNING
This element indicates a warning or caution.
Throughout this book, we provide examples of configuration files to help
guide you in securing your own Hadoop environment. A downloadable
version of some of those examples is available at
https://github.com/hadoop-security/examples. In Chapter 13, we provide a
complete example of designing, implementing, and deploying a web interface
for saving snapshots of web pages. The complete source code for the
example, along with instructions for securely configuring a Hadoop cluster
for deployment of the application, is available for download at GitHub.
This book is here to help you get your job done. In general, if example code
is offered with this book, you may use it in your programs and
documentation. You do not need to contact us for permission unless you’re
reproducing a significant portion of the code. For example, writing a
program that uses several chunks of code from this book does not require
permission. Selling or distributing a CD-ROM of examples from O’Reilly
books does require permission. Answering a question by citing this book and
quoting example code does not require permission. Incorporating a
significant amount of example code from this book into your product’s
documentation does require permission.
We appreciate, but do not require, attribution. An attribution usually includes
the title, author, publisher, and ISBN. For example: “Hadoop Security by
Ben Spivey and Joey Echeverria (O’Reilly). Copyright 2015 Ben Spivey and
Joey Echeverria, 978-1-491-90098-7.”
If you feel your use of code examples falls outside fair use or the permission
given above, feel free to contact us at permissions@oreilly.com.
NOTE
Safari Books Online is an on-demand digital library that delivers expert
content in both book and video form from the world’s leading authors in
technology and business.
Technology professionals, software developers, web designers, and business
and creative professionals use Safari Books Online as their primary resource
for research, problem solving, learning, and certification training.
Safari Books Online offers a range of plans and pricing for enterprise,
government, education, and individuals.
Members have access to thousands of books, training videos, and
prepublication manuscripts in one fully searchable database from publishers
like O’Reilly Media, Prentice Hall Professional, Addison-Wesley
Professional, Microsoft Press, Sams, Que, Peachpit Press, Focal Press,
Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM
Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders,
McGraw-Hill, Jones & Bartlett, Course Technology, and hundreds more. For
more information about Safari Books Online, please visit us online.
Please address comments and questions concerning this book to the
publisher:
O’Reilly Media, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
800-998-9938 (in the United States or Canada)
707-829-0515 (international or local)
707-829-0104 (fax)
We have a web page for this book, where we list errata, examples, and any information. You can access this page at http://bit.ly/hadoop-
security.
To comment or ask technical questions about this book, send email to
bookquestions@oreilly.com.
For more information about our books, courses, conferences, and news, see
our website at http://www.oreilly.com.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://www.youtube.com/oreillymedia
Ben and Joey would like to thank the following people who have made this
book possible: our editor, Marie Beaugureau, and all of the O’Reilly Media
staff; Ann Spencer; Eddie Garcia for his guest chapter contribution; our
primary technical reviewers, Patrick Angeles, Brian Burton, Sean Busbey,
Mubashir Kazia, and Alex Moundalexis; Jarek Jarcec Cecho; fellow authors
Eric Sammer, Lars George, and Tom White for their valuable insight; and the
folks at Cloudera for their collective support to us and all other authors.
I would like to dedicate this book to Maria Antonia Fernandez, Jose
Fernandez, and Sarah Echeverria, three people that inspired me every day
and taught me that I could achieve anything I set out to achieve. I also want to
thank my parents, Maria and Fred Echeverria, and my brothers and sisters,
Fred, Marietta, Angeline, and Paul Echeverria, and Victoria Schandevel, for
their love and support throughout this process. I couldn’t have done this
without the incredible support of the Apache Hadoop community. I couldn’t
possibly list everybody that has made an impact, but you need look no further
than Ben’s list for a great start. Lastly, I’d like to thank my coauthor, Ben.
This is quite a thing we’ve done, Bennie (you’re welcome, Paul).
I would like to dedicate this book to the loving memory of Ginny Venable and
Rob Trosinski, two people that I miss dearly. I would like to thank my wife,
Theresa, for her endless support and understanding, and Oliver Morton for
always making me smile. To my parents, Rich and Linda, thank you for
always showing me the value of education and setting the example of
professional excellence. Thanks to Matt, Jess, Noah, and the rest of the
Spivey family; Mary, Jarrod, and Dolly Trosinski; the Swope family; and the
following people that have helped me greatly along the way: Hemal Kanani
(BOOM), Ted Malaska, Eric Driscoll, Paul Beduhn, Kari Neidigh, Jeremy
Beard, Jeff Shmain, Marlo Carrillo, Joe Prosser, Jeff Holoman, Kevin
O’Dell, Jean-Marc Spaggiari, Madhu Ganta, Linden Hillenbrand, Adam
Smieszny, Benjamin Vera-Tudela, Prashant Sharma, Sekou Mckissick,
Melissa Hueman, Adam Taylor, Kaufman Ng, Steve Ross, Prateek Rungta,
Steve Totman, Ryan Blue, Susan Greslik, Todd Grayson, Woody Christy, Vini
Varadharajan, Prasad Mujumdar, Aaron Myers, Phil Langdale, Phil Zeyliger,
Brock Noland, Michael Ridley, Ryan Geno, Brian Schrameck, Michael
Katzenellenbogen, Don Brown, Barry Hurry, Skip Smith, Sarah Stanger,
Jason Hogue, Joe Wilcox, Allen Hsiao, Jason Trost, Greg Bednarski, Ray
Scott, Mike Wilson, Doug Gardner, Peter Guerra, Josh Sullivan, Christine
Mallick, Rick Whitford, Kurt Lorenz, Jason Nowlin, and Chuck
Wigelsworth. Last but not least, thanks to Joey for giving in to my pleading to
help write this book—I never could have done this alone! For those that I
have inadvertently forgotten, please accept my sincere apologies.
I would like to thank my family and friends for their support and
encouragementon my first book writing experience. Thank you, Sandra,
Kassy, Sammy, Ally, Ben, Joey, Mark, and Peter.
Thank you for reading this book. While the authors of this book have made
every attempt to explain, document, and recommend different security
features in the Hadoop ecosystem, there is no warranty expressed or implied
that using any of these features will result in a fully secured cluster. From a
security point of view, no information system is 100% secure, regardless of
the mechanisms used to protect it. We encourage a constant security review
process for your Hadoop environment to ensure the best possible security
stance. The authors of this book and O’Reilly Media are not responsible for
any damage that might or might not have come as a result of using any of the
features described in this book. Use at your own risk.
Back in 2003, Google published a paper describing a scale-out architecture
for storing massive amounts of data across clusters of servers, which it
called the Google File System (GFS). A year later, Google published another
paper describing a programming model called MapReduce, which took
advantage of GFS to process data in a parallel fashion, bringing the program
to where the data resides. Around the same time, Doug Cutting and others
were building an open source web crawler now called Apache Nutch. The
Nutch developers realized that the MapReduce programming model and GFS
were the perfect building blocks for a distributed web crawler, and they
began implementing their own versions of both projects. These components
would later split from Nutch and form the Apache Hadoop project. The
ecosystem1 of projects built around Hadoop’s scale-out architecture brought
about a different way of approaching problems by allowing the storage and
processing of all data important to a business.
Before this book can begin covering Hadoop-specific content, it is useful to
understand some key theory and terminology related to information security.
At the heart of information security theory is a model known as CIA, which
stands for confidentiality, integrity, and availability. These three
components of the model are high-level concepts that can be applied to a
wide range of information systems, computing platforms, and—more
specifically to this book—Hadoop. We also take a closer look at
authentication, authorization, and accounting, which are critical
components of secure computing that will be discussed in detail throughout
the book.
WARNING
While the CIA model helps to organize some information security principles, it is important
to point out that this model is not a strict set of standards to follow. Security features in the
Hadoop platform may span more than one of the CIA components, or possibly none at all.
Confidentiality is a security principle focusing on the notion that information
is only seen by the intended recipients. For example, if Alice sends a letter in
the mail to Bob, it would only be deemed confidential if Bob were the only
person able to read it. While this might seem straightforward enough, several
important security concepts are necessary to ensure that confidentiality
actually holds. For instance, how does Alice know that the letter she is
sending is actually being read by the right Bob? If the correct Bob reads the
letter, how does he know that the letter actually came from the right Alice? In
order for both Alice and Bob to take part in this confidential information
passing, they need to have an identity that uniquely distinguishes themselves
from any other person. Additionally, both Alice and Bob need to prove their
identities via a process known as authentication. Identity and authentication
are key components of Hadoop security and are covered at length in
Another important concept of confidentiality is encryption. Encryption is a
mechanism to apply a mathematical algorithm to a piece of information
where the output is something that unintended recipients are not able to read.
Only the intended recipients are able to decrypt the encrypted message back
to the original unencrypted message. Encryption of data can be applied both
at rest and in flight. At-rest data encryption means that data resides in an
encrypted format when not being accessed. A file that is encrypted and
located on a hard drive is an example of at-rest encryption. In-flight
encryption, also known as over-the-wire encryption, applies to data sent
from one place to another over a network. Both modes of encryption can be
used independently or together. At-rest encryption for Hadoop is covered in
Chapter 9, and in-flight encryption is covered in Chapters 10 and 11.
Integrity is an important part of information security. In the previous example
where Alice sends a letter to Bob, what happens if Charles intercepts the
letter in transit and makes changes to it unbeknownst to Alice and Bob? How
can Bob ensure that the letter he receives is exactly the message that Alice
sent? This concept is data integrity. The integrity of data is a critical
component of information security, especially in industries with highly
sensitive data. Imagine if a bank did not have a mechanism to prove the
integrity of customer account balances? A hospital’s data integrity of patient
records? A government’s data integrity of intelligence secrets? Even if
confidentiality is guaranteed, data that doesn’t have integrity guarantees is at
risk of substantial damage. Integrity is covered in Chapters 9 and 10.
scheduled downtime for upgrades or applying security patches, but it can
also be impacted by security events such as distributed denial-of-service
(DDoS) attacks. The handling of high-availability configurations is covered
in Hadoop Operations and Hadoop: The Definitive Guide, but the concepts
will be covered from a security perspective in Chapters 3 and 10.
Authentication, Authorization, and Accounting
Authentication, authorization, and accounting (often abbreviated, AAA) refer
to an architectural pattern in computer security where users of a service
prove their identity, are granted access based on rules, and where a
recording of a user’s actions is maintained for auditing purposes. Closely
tied to AAA is the concept of identity. Identity refers to how a system
distinguishes between different entities, users, and services, and is typically
represented by an arbitrary string, such as a username or a unique number,
such as a user ID (UID).
![]()
![]()
![]()
![]()
Before diving into how Hadoop supports identity, authentication,
authorization, and accounting, consider how these concepts are used in the
much simpler case of using the command on a single Linux server. Let’s
take a look at the terminal session for two different users, Alice and Bob. On
this server, Alice is given the username alice and Bob is given the username
bob. Alice logs in first, as shown in Example 1-1.
Example 1-1. Authentication and authorization
![]()
![]()

In Example 1-1, Alice logs in through SSH and she is immediately prompted
for her password. Her username/password pair is used to verify her entry in
the /etc/passwd password file. When this step is completed, Alice has been
authenticated with the identity alice. The next thing Alice does is use the
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
command to get the status of the service, which requires
superuser privileges. The command succeeds, indicating that Alice was
authorized to perform that command. In the case of sudo, the rules that
govern who is authorized to execute commands as the superuser are stored in
the /etc/sudoers file, shown in Example 1-2.
[root@hadoop01 ~]# cat /etc/sudoers
root ALL = (ALL) ALL
%wheel ALL = (ALL) NOPASSWD:ALL
[root@hadoop01 ~]#
In Example 1-2, we see that the root user is granted permission to execute
any command with sudo and that members of the wheel group are granted
permission to execute any command with sudo while not being prompted for
a password. In this case, the system is relying on the authentication that was
performed during login rather than issuing a new authentication challenge.
The final question is, how does the system know that Alice is a member of
the wheel group? In Unix and Linux systems, this is typically controlled by
the /etc/group file.
In this way, we can see that two files control Alice’s identity: the
/etc/passwd file (see Example 1-4) assigns her username a unique UID as
well as details such as her home directory, while the /etc/group file (see
Example 1-3) further provides information about the identity of groups on the
system and which users belong to which groups. These sources of identity
information are then used by the sudo command, along with authorization
rules found in the /etc/sudoers file, to verify that Alice is authorized to
execute the requested command.
[root@hadoop01 ~]# grep wheel /etc/group
wheel:x:10:alice
[root@hadoop01 ~]#
[root@hadoop01 ~]# grep alice /etc/passwd
alice:x:1000:1000:Alice:/home/alice:/bin/bash
[root@hadoop01 ~]#
Now let’s see how Bob’s session turns out in Example 1-5.
![]()
![]()
![]()
![]()
![]()
![]()
Example 1-5. Authorization failure

![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
In this example, Bob is able to authenticate in much the same way that Alice
does, but when he attempts to use he sees very different behavior. First,
he is again prompted for his password and after successfully supplying it, he
is denied permission to run the command with superuser privileges.
This happens because, unlike Alice, Bob is not a member of the wheel group
and is therefore not authorized to use the command.
That covers identity, authentication, and authorization, but what about
accounting? For actions that interact with secure services such as SSH and
![]()
![]()
![]()
![]()
, Linux generates a logfile called /var/log/secure. This file records an
account of certain actions including both successes and failures. If we take a
look at this log after Alice and Bob have performed the preceding actions,
we see the output in Example 1-6 (formatted for readability).
Example 1-6. /var/log/secure
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()

bob from 172.18.12.166 port 65017 ssh2
Feb 12 20:33:15 ip-172-25-3-79 sshd[3799]: pam_unix(sshd:session):
session opened for user bob by (uid=0)
Feb 12 20:33:39 ip-172-25-3-79 sudo: bob : user NOT in sudoers;
TTY=pts/2 ; PWD=/home/bob ; USER=root ; COMMAND=/sbin/service sshd status
[root@hadoop01 ~]#
For both users, the fact that they successfully logged in using SSH is
recorded, as are their attempts to use sudo. In Alice’s case, the system
records that she successfully used sudo to execute the /sbin/service sshd
status command as the user root. For Bob, on the other hand, the system
records that he attempted to execute the /sbin/service sshd status
command as the user root and was denied permission because he is not in
/etc/sudoers.
This example shows how the concepts of identity, authentication,
authorization, and accounting are used to maintain a secure system in the
relatively simple example of a single Linux server. These concepts are
covered in detail in a Hadoop context in Part II.
Hadoop Security: A Brief History
Hadoop has its heart in storing and processing large amounts of data
efficiently and as it turns out, cheaply (monetarily) when compared to other
platforms. The focus early on in the project was around the actual technology
to make this happen. Much of the code covered the logic on how to deal with
the complexities inherent in distributed systems, such as handling of failures
and coordination. Due to this focus, the early Hadoop project established a
security stance that the entire cluster of machines and all of the users
accessing it are part of a trusted network. What this effectively means is that
Hadoop did not have strong security measures in place to enforce, well,
much of anything.
As the project evolved, it became apparent that at a minimum there should be
a mechanism for users to strongly authenticate to prove their identities. The
mechanism chosen for the project was Kerberos, a well-established protocol
that today is common in enterprise systems such as Microsoft Active
Directory. After strong authentication came strong authorization. Strong
authorization defined what an individual user could do after they had been
authenticated. Initially, authorization was implemented on a per-component
basis, meaning that administrators needed to define authorization controls in
multiple places. Eventually this became easier with Apache Sentry
(Incubating), but even today there is not a holistic view of authorization
across the ecosystem, as we will see in Chapters 6 and 7.
Hadoop Components and Ecosystem
Readers that are well versed in the components listed can safely skip to the
next section. Unless otherwise noted, security features described throughout
this book apply to the versions of the associated project listed in Table 1-1.
Project | Version |
Apache HDFS | 2.3.0 |
Apache MapReduce (for MR1) | 1.2.1 |
Apache YARN (for MR2) | 2.3.0 |
Apache Hive | 0.12.0 |
Project | Version |
Cloudera Impala | 2.0.0 |
Apache HBase | 0.98.0 |
Apache Accumulo | 1.6.0 |
Apache Solr | 4.4.0 |
Apache Oozie | 4.0.0 |
Cloudera Hue | 3.5.0 |
Apache ZooKeeper | 3.4.5 |
Apache Flume | 1.5.0 |
Apache Sqoop | 1.4.4 |
Apache Sentry (Incubating) | 1.4.0-incubating |
a An astute reader will notice some omissions in the list of projects covered. In particular, there is
no mention of Apache Spark, Apache Ranger, or Apache Knox. These projects were omitted
due to time constraints and given their status as relatively new additions to the Hadoop
ecosystem.
The Hadoop Distributed File System, or HDFS, is often considered the
foundation component for the rest of the Hadoop ecosystem. HDFS is the
storage layer for Hadoop and provides the ability to store mass amounts of
data while growing storage capacity and aggregate bandwidth in a linear
fashion. HDFS is a logical filesystem that spans many servers, each with
multiple hard drives. This is important to understand from a security
perspective because a given file in HDFS can span many or all servers in the
Hadoop cluster. This means that client interactions with a given file might
require communication with every node in the cluster. This is made possible
by a key implementation feature of HDFS that breaks up files into blocks.
Each block of data for a given file can be stored on any physical drive on any
node in the cluster. Because this is a complex topic that we cannot cover in
depth here, we are omitting the details of how that works and recommend
Hadoop: The Definitive Guide, 3rd Edition by Tom White (O’Reilly). The
important security takeaway is that all files in HDFS are broken up into
blocks, and clients using HDFS will communicate over the network to all of
the servers in the Hadoop cluster when reading and writing files.
HDFS is built on a head/worker architecture and is comprised of two
primary components: NameNode (head) and DataNode (worker). Additional
components include JournalNode, HttpFS, and NFS Gateway:
NameNode
The NameNode is responsible for keeping track of all the metadata
related to the files in HDFS, such as filenames, block locations, file
permissions, and replication. From a security perspective, it is important
to know that clients of HDFS, such as those reading or writing files,
always communicate with the NameNode. Additionally, the NameNode
provides several important security functions for the entire Hadoop
ecosystem, which are described later.
DataNode
JournalNode
The JournalNode is a special type of component for HDFS. When HDFS
isconfigured for high availability (HA), JournalNodes take over the
NameNoderesponsibility for writing HDFS metadata information.
Clusters typically have an odd number of JournalNodes (usually three or
five) to ensure majority. For example, if a new file is written to HDFS,
the metadata about the file is written to every JournalNode. When the
majority of the JournalNodes successfully write this information, the
change is considered durable. HDFS clients and DataNodes do not
interact with JournalNodes directly.
HttpFS
NFS Gateway
KMS
The Hadoop Key Management Server, or KMS, plays an important role
inHDFS transparent encryption at rest. Its purpose is to act as the
intermediary between HDFS clients, the NameNode, and a key server,
handling encryption operations such as decrypting data encryption keys
and managing encryption zone keys. This is covered in detail in
as Impala and Spark, use YARN as the resource management framework.
While YARN provides a more general resource management framework,
MapReduce is still the canonical application that runs on it. MapReduce that
runs on YARN is considered version 2, or MR2 for short. The YARN
architecture consists of the following components:
ResourceManager
JobHistory Server
NodeManager
The NodeManager daemon is responsible for launching individual tasks
for jobs within YARN containers, which consist of virtual cores (CPU
resources) and RAM resources. Individual tasks can request some
number of virtual cores and memory depending on its needs. The
minimum, maximum, and increment ranges are defined by the
ResourceManager. Tasks execute as separate processes with their own
JVM. One important role of the NodeManager is to launch a special task
called the ApplicationMaster. This task is responsible for managing the
status of all tasks for the given application. YARN separates resource
management from task management to better scale YARN applications in
large clusters as each job executes its own ApplicationMaster.
YARN-based verison of MapReduce from the standalone MapReduce
framework, which has been retroactively named MR1. MapReduce jobs are
submitted by clients to the MapReduce framework and operate over a subset
of data in HDFS, usually a specified directory. MapReduce itself is a
programming paradigm that allows chunks of data, or blocks in the case of
HDFS, to be processed by multiple servers in parallel, independent of one
another. While a Hadoop developer needs to know the intricacies of how
MapReduce works, a security architect largely does not. What a security
architect needs to know is that clients submit their jobs to the MapReduce
framework and from that point on, the MapReduce framework handles the
distribution and execution of the client code across the cluster. Clients do not
interact with any of the nodes in the cluster to make their job run. Jobs
themselves require some number of tasks to be run to complete the work.
Each task is started on a given node by the MapReduce framework’s
scheduling algorithm.
NOTE
Individual tasks started by the MapReduce framework on a given server are executed as
different users depending on whether Kerberos is enabled. Without Kerberos enabled,
individual tasks are run as the mapred system user. When Kerberos is enabled, the
individual tasks are executed as the user that submitted the MapReduce job. However,
even if Kerberos is enabled, it may not be immediately apparent which user is executing
the underlying MapReduce tasks when another component or tool is submitting the
MapReduce job. See “Impersonation” for a relevant detailed discussion regarding Hive
impersonation.
Similar to HDFS, MapReduce is also a head/worker architecture and is
comprised of two primary components:
JobTracker (head)
security and operational features such as job queues, scheduling pools,
and access control lists to determine authorization. Lastly, the JobTracker
handles job metrics and other information about the job, which are
communicated to it from the various TaskTrackers throughout the
execution of a given job. The JobTracker includes both resource
management and task management, which were split in MR2 between the
ResourceManager and ApplicationMaster.
TaskTracker (worker)
TaskTrackers execute both map and reduce tasks, and the amount of each
that can be run concurrently is part of the MapReduce configuration. The
importanttakeaway from a security standpoint is that the JobTracker
decides what tasks to be run and on which TaskTrackers. Clients do not
have control over how tasks are assigned, nor do they communicate with
TaskTrackers as part of normal job execution.
A key point about MapReduce is that other Hadoop ecosystem components
are frameworks and libraries on top of MapReduce, meaning that
MapReduce handles the actual processing of data, but these frameworks and
libraries abstract the MapReduce job execution from clients. Hive, Pig, and
Sqoop are examples of components that use MapReduce in this fashion.
TIP
Understanding how MapReduce jobs are submitted is an important part of user auditing in
Hadoop, and is discussed in detail in “Block access tokens”. A user submitting her own
Java MapReduce code is a much different activity from a security point of view than a
user using Sqoop to import data from a RDBMS or executing a SQL query in Hive, even
though all three of these activities use MapReduce.
Metastore database
The metastore database is a relational database that contains all the Hive
metadata, such as information about databases, tables, columns, and data
types. This information is used to apply structure to the underlying data in
HDFS at the time of access, also known as schema on read.
Metastore server
HiveServer2
HCatalog
For more thorough coverage of Hive, have a look at Programming Hive by
Edward Capriolo, Dean Wampler, and Jason Rutherglen (O’Reilly).
Impala daemon (impalad)
StateStore
Catalog server
TIP
For more thorough coverage of all things Impala, check out Getting StartedImpala (O’Reilly).
Sentry is the component that provides fine-grained role-based access
controls (RBAC) to several of the other ecosystem components, such as Hive
and Impala. While individual components may have their own authorization
mechanism, Sentry provides a unified authorization that allows centralized
policy enforcement across components. It is a critical component of Hadoop
security, which is why we have dedicated an entire chapter to the topic
(Chapter 7). Sentry consists of the following components:
Sentry server
Policy database
Apache HBase is a distributed key/value store inspired by Google’spaper, “BigTable: A Distributed Storage System for Structured. HBase typically utilizes HDFS as the underlying storage layer for
data, and for the purposes of this book we will assume that is the case.
HBase tables are broken up into regions. These regions are partitioned by
row key, which is the index portion of a given key. Row IDs are sorted, thus
a given region has a range of sorted row keys. Regions are hosted by a
RegionServer, where clients request data by a key. The key is comprised of
several components: the row key, the column family, the column qualifier,
and the timestamp. These components together uniquely identify a value
stored in the table.
Clients accessing HBase first look up the RegionServers that are responsible
for hosting a particular range of row keys. This lookup is done by scanning
the hbase:meta table. When the right RegionServer is located, the client
will make read/write requests directly to that RegionServer rather than
through the master. The client caches the mapping of regions to
RegionServers to avoid going through the lookup process. The location of the
server hosting the hbase:meta table is looked up in ZooKeeper. HBase
consists of the following components:
Master
RegionServer
REST server
Thrift server
For more information on the architecture of HBase and the use cases it is best
suited for, we recommend HBase: The Definitive Guide by Lars George
(O’Reilly).
Apache Accumulo is a sorted and distributed key/value store designed to be
a robust, scalable, high-performance storage and retrieval system. Like
HBase, Accumulo was originally based on the Google BigTable design, but
was built on top of the Apache Hadoop ecosystem of projects (in particular,
HDFS, ZooKeeper, and Apache Thrift). Accumulo uses roughly the same
data model as HBase. Each Accumulo table is split into one or more tablets
that contains a roughly equal number of records distributed by the record’s
row ID. Each record also has a multipart column key that includes a column
family, column qualifier, and visibility label. The visibility label was one of
Accumulo’s first major departures from the original BigTable design.
Visibility labels added the ability to implement cell-level security (we’ll
discuss them in more detail in Chapter 6). Finally, each record also contains
a timestamp that allows users to store multiple versions of records that
otherwise share the same record key. Collectively, the row ID, column, and
timestamp make up a record’s key, which is associated with a particular
value.
The tablets are distributed by splitting up the set of row IDs. The split points
are calculated automatically as data is inserted into a table. Each tablet is
hosted by a single TabletServer that is responsible for serving reads and
writes to data in the given tablet. Each TabletServer can host multiple tablets
from the same tables and/or different tables. This makes the tablet the unit of
distribution in the system.
When clients first access Accumulo, they look up the location of the
TabletServer hosting the accumulo.root table. The accumulo.root table
stores the information for how the accumulo.meta table is split into tablets.
The client will directly communicate with the TabletServer hosting
accumulo.root and then again for TabletServers that are hosting the tablets
of the accumulo.meta table. Because the data in these tables—especially
accumulo.root—changes relatively less frequently than other data, the
client will maintain a cache of tablet locations read from these tables to
avoid bottlenecks in the read/write pipeline. Once the client has the location
of the tablets for the row IDs that it is reading/writing, it will communicate
directly with the required TabletServers. At no point does the client have to
interact with the Master, and this greatly aids scalability. Overall, Accumulo
consists of the following components:
Master
TabletServer
GarbageCollector
Tracer
Monitor
The Monitor is a web application for monitoring the state of the
Accumulo cluster. It displays key metrics such as record count, cache
hit/miss rates, and table information such as scan rate. The Monitor also
acts as an endpoint for log forwarding so that errors and warnings can be
diagnosed from a single interface.
The Apache Solr project, and specifically SolrCloud, enables the search and
retrieval of documents that are part of a larger collection that has been
sharded across multiple physical servers. Search is one of the canonical use
cases for big data and is one of the most common utilities used by anyone
accessing the Internet. Solr is built on top of the Apache Lucene project,
which actually handles the bulk of the indexing and search capabilities. Solr
expands on these capabilities by providing enterprise search features such as
faceted navigation, caching, hit highlighting, and an administration interface.
Solr has a single component, the server. There can be many Solr servers in a
single deployment, which scale out linearly through the sharding provided by
SolrCloud. SolrCloud also provides replication features to accommodate
failures in a distributed environment.
Apache Oozie is a workflow management and orchestration system for
Hadoop. It allows for setting up workflows that contain various actions,
each of which can utilize a different component in the Hadoop ecosystem.
For example, an Oozie workflow could start by executing a Sqoop import to
move data into HDFS, then a Pig script to transform the data, followed by a
Hive script to set up metadata structures. Oozie allows for more complex
workflows, such as forks and joins that allow multiple steps to be executed
in parallel, and other steps that rely on multiple steps to be completed before
continuing. Oozie workflows can run on a repeatable schedule based on
different types of input conditions such as running at a certain time or waiting
until a certain path exists in HDFS.
Oozie consists of just a single server component, and this server is
responsible for handling client workflow submissions, managing the
execution of workflows, and reporting status.
Additionally, ZooKeeper is heavily used in the Hadoop ecosystem for
synchronizing high availability (HA) services, such as NameNode HA and
ResourceManager HA.
ZooKeeper itself is a distributed system that relies on an odd number of
servers called a ZooKeeper ensemble to reach a quorum, or majority, to
acknowledge a given transaction. ZooKeeper has only one component, the
ZooKeeper server.
ingesting log events into HDFS. The Flume architecture consists of three
main pieces: sources, sinks, and channels.
A Flume source defines how data is to be read from the upstream provider.
This would include things like a syslog server, a JMS queue, or even polling
a Linux directory. A Flume sink defines how data should be written
downstream. Common Flume sinks include an HDFS sink and an HBase sink.
Lastly, a Flume channel defines how data is stored between the source and
sink. The two primary Flume channels are the memory channel and file
channel. The memory channel affords speed at the cost of reliability, and the
file channel provides reliability at the cost of speed.
Flume consists of a single component, a Flume agent. Agents contain the
code for sources, sinks, and channels. An important part of the Flume
architecture is that Flume agents can be connected to each other, where the
sink of one agent connects to the source of another. A common interface in
this case is using an Avro source and sink. Flume ingestion and security is
covered in Chapter 10 and in Using Flume.
Sqoop1 is a set of client libraries that are invoked from the command line
using the sqoop binary. These client libraries are responsible for the actual
submission of the MapReduce job to the proper framework (e.g., traditional
MapReduce or MapReduce2 on YARN). Sqoop is discussed in more detail
in Chapter 10 and in Apache Sqoop Cookbook.
Cloudera Hue is a web application that exposes many of the Hadoop
ecosystem components in a user-friendly way. Hue allows for easy access
into the Hadoop cluster without requiring users to be familiar with Linux or
the various command-line interfaces the components have. Hue has several
different security controls available, which we’ll look at in Chapter 12. Hue
is comprised of the following components:
Hue server
This is the main component of Hue. It is effectively a web server that
serves web content to users. Users are authenticated at first logon and
from there, actions performed by the end user are actually done by Hue
itself on behalf of the user. This concept is known as impersonation(covered in Chapter 5).
Kerberos Ticket Renewer
As the name implies, this component is responsible for periodically
renewing the Kerberos ticket-granting ticket (TGT), which Hue uses to
interact with the Hadoop cluster when the cluster has Kerberos enabled
(Kerberos is discussed at length in Chapter 4).
This chapter introduced some common security terminology that builds the
foundation of the topics covered throughout the rest of the book. A key
takeaway from this chapter is to become comfortable with the fact that
security for Hadoop is not a completely foreign discussion. Tried-and-true
security principles such as CIA and AAA resonate in the Hadoop context and
will be discussed at length in the chapters to come. Lastly, we took a look at
many of the Hadoop ecosystem projects (and their individual components) to
understand their purpose in the stack, and to get a sense at how security will
apply.
In the next chapter, we will dive right into securing distributed systems. You
will find that many of the security threats and mitigations that apply to
Hadoop are generally applicable to distributed systems.
1 Apache Hadoop itself consists of four subprojects: HDFS, YARN,
MapReduce, and Hadoop Common. However, the Hadoop ecosystem,
Hadoop, and the related projects that build on or integrate with Hadoop are
often shortened to just Hadoop. We attempt to make it clear when we’re
referring to Hadoop the project versus Hadoop the ecosystem.
Part I. Security Architecture
Chapter 2. Securing Distributed
Systems
In Chapter 1, we covered several key principles of secure computing. In this
chapter, we will take a closer look at the interesting challenges that are
present when considering the security of distributed systems. As we will see,
being distributed considerably increases the potential threats to the system,
thus also increasing the complexity of security measures needed to help
mitigate those threats. A real-life example will help illustrate how security
requirements increase when a system becomes more distributed.
Let’s consider a bank as an example. Many years ago, everyday banking for
the average person meant driving down to the local bank, visiting a bank
teller, and conducting transactions in person. The bank’s security measures
would have included checking the person’s identification, and account
number, and verifying that the requested action could be performed, such as
ensuring there was enough money in the account to cover a withdrawal.
Over the years, banks became larger. Your local hometown bank probably
became a branch of a larger bank, thus giving you the ability to conduct
banking not just at the bank’s nearby location but also at any of its other
locations. The security measures necessary to protect assets have grown
because there is no longer just a single physical location to protect. Also,
more bank tellers need to be properly trained.
Taking this a step further, banks eventually started making use of ATMs to
allow customers to withdraw money without having to go to a branch
location. As you might imagine, even more security controls are necessary to
protect the bank beyond what was required when banking was a human
interaction. Next, banks became interconnected with other banks, which
allowed customers from one bank to use the ATMs of a different bank. Banks
then needed to establish security controls between themselves to ensure that
no security was lost as a result of this interconnectivity. Lastly, the Internet
movement introduced the ability to do online banking through a website, or
even from mobile devices. This dramatically increased potential threats and
the security controls needed.
As you can see, what started as a straightforward security task to protect a
small bank in your town has become orders of magnitude more difficult the
more distributed and interconnected the bank became over decades of time.
While this example might seem obvious, it starts to frame the problem of how
to design a security architecture for a system that can be distributed across
tens, hundreds, or even thousands of machines. It is no small task but it can
be made less intimidating by breaking it down into pieces, starting with
understanding threats.
Unauthorized Access/Masquerade
One of the most common threat categories comes in the form of unauthorized
access. This happens when someone successfully accesses a system when he
should have otherwise been denied access. One common way for this to
happen is from a masquerade attack. Masquerade is the notion that an invalid
user presents himself as a valid user in order to gain access. You might
wonder how the invalid user presented himself as a valid user. The most
likely answer that the attacker obtained a valid username and associated
password.
Masquerade attacks are especially prominent since the age of the Internet,
and specifically for distributed systems. Attackers have a variety of ways to
obtain valid usernames and passwords, such as trying common words and
phrases as passwords, or knowing words that are related to the valid user
that might be used as a password. For example, attackers looking to obtain
valid login credentials for a social media website, might collect keywords
from a person’s public posts to come up with a password list to try (e.g., if
the attackers were focusing on New York–based users who list “baseball” as
a hobby, they might try the password yankees).
In the case of an invalid user executing a successful masquerade attack, how
would a security administrator know? After all, if an attacker logged in with
a valid user’s credentials, wouldn’t this appear as normal from the
distributed system’s perspective? Not necessarily. Typically, masquerade
attacks can be profiled by looking at audit logs for login attempts. If an
attacker is using a list of possible passwords to try against a user account, the
unsuccessful attempts should show up in audit logfiles. Seeing a high number
of failed login attempts for a user can usually be attributed to an attack. A
valid user might mistype or forget her password, leading to a small number
of failed login attempts, but 20 successive failed login attempts, for example,
would be unusual.
Another common footprint for masquerade attacks is to look at where, from a
network perspective, the login attempts are coming from. Profiling login
attempts by IP addresses can be a good way to discover if a masquerade
attack is attempted. Are the IP addresses shown as the client attempting to log
in consistent with what is expected, such as coming from a known subnet of
company IP addresses, or are they sourced from another country on the other
side of the world? Also, what time of day did the login attempts occur? Did
Alice try to login to the system at 3:00 a.m., or did she log in during normal
business hours?
Another form of unauthorized access comes from an attacker exploiting a
vulnerability in the system, thus gaining entry without needing to present
valid credentials. Vulnerabilities are discussed in “Vulnerabilities”.
Arguably the single most damaging threat category is the insider threat. As
the name implies, the attacker comes from inside the business and is a regular
user. Insider threats can include employees, consultants, and contractors.
What makes the insider threat so scary is that the attacker already has internal
access to the system. The attacker can log in with valid credentials, get
authorized by the system to perform a certain function, and pass any number
of security checks along the way because she is supposed to be granted
access. This can result in a blatant attack on a system, or something much
more subtle like the attacker leaking sensitive data to unauthorized users by
leveraging her own accesses.
Throughout this book, you will find security features that ensure that the right
users are accessing only the data and services they should be. Combating
insider threats requires effective auditing practices (described in Chapter 8).
In addition to the technical tools available to help combat the insider threat,
business policies need to be established to enforce proper auditing, and
procedures that respond to incidents must be outlined. The need for these
policies is true for all of the threat categories described in this chapter,
though best practices for setting such policies are not covered.
Denial of service (DoS), is a situation where a service is unavailable to one
or more clients. The term service in this case is an umbrella that includes
access to data, processing capabilities, and the general usability of the
system in question. How the denial of service happens can come from a
variety of different attack vectors. In the age of the Internet, a common attack
vector is to simply overwhelm the system in question with excessive network
traffic. This is done by using many computers in parallel, thus making the
attack a distributed denial of service (DDoS). When the system is
bombarded with too many requests for it to handle, it starts failing in some
way, from dropping other valid requests to outright failure of the system.
While distributed systems typically benefit from having fault tolerance of
some kind, DoS attacks are still possible. For example, if a distributed
system contains 50 servers, it might be difficult for attackers to disrupt
service to all 50 machines. What if the distributed system is behind just a few
network devices, such as a network firewall and an access switch? Attackers
can use this to their advantage by targeting the gateway into the distributed
system rather than the distributed system itself. This point is important and
will be covered in Chapter 3 when discuss about architecting a network
perimeter around the cluster.
Data is the single most important component of a distributed system. Without
data, a distributed system is nothing more than an idle hum of servers that
rack up the electric and cooling bills in a data center. Because data is so
important, it is also the focus of security attacks. Threats to data are present
in multiple places in a distributed system. First, data must be stored in a
secure fashion to prevent unauthorized viewing, tampering, or deletion. Next,
data must also be protected in transit, because distributed systems are, well,
distributed. The passing of data across a network can be threatened by
something disruptive like a DoS attack, or something more passive such as an
attacker capturing the network traffic unbeknownst to the communicating
parties. In Chapter 1, we discussed the CIA model and its components.
Ultimately, the CIA model is all about mitigating threats to data.
The coverage of threat categories in the previous section probably was not
the first time you have heard about these things. It’s important that in addition
to understanding these threat categories you also assess the risk to your
particular distributed system. For example, while a denial-of-service attack
may be highly likely to occur for systems that are directly connected to the
Internet, systems that have no outside network access, such as those on a
company intranet, have a much lower risk of this actually happening. Notice
that the risk is low and not completely removed, an important distinction.
Assessing the threats to a distributed system involves taking a closer look at
two key components: the users and the environment. Once you understand
these components, assessing risk becomes more manageable.
Once users are classified into groups by business function, you can start to
identify access patterns and tools that these groups of users need in order to
use the distributed system. For example, if the users of the distributed system
are all developers, several assumptions can be made about the need for shell
access to nodes in the system, logfiles to debug jobs, and developer tools. On
the other hand, business intelligence analysts might not need any of those
things and will instead require a suite of analytical tools that interact with the
distributed system on the user’s behalf.
There will also be users with indirect access to the system. These users
won’t need access to data or processing resources of the system. However,
they’ll still interact with it as a part of, for example, support functions such
as system maintenance, health monitoring, and user auditing. These types of
users need to be accounted for in the overall security model.
To assess the risk for our distributed system, we’ll also need to understand
the environment it resides in. Generally, this will mean assessing the
operational environment both in relation to other logical systems and the
physical world. We’ll take a look at the specifics for Hadoop in Chapter 3.
One of the key criteria for assessing the environment, mentioned briefly, is to
look at whether the distributed system is accessible to the Internet. If so, a
whole host of threats are far more likely to be realized, such as DoS attacks,
vulnerability exploits, and viruses. Distributed systems that are indeed
connected to the Internet will require constant monitoring and alerting, as
well as a regular cadence for applying software patches and updating various
security software definitions.
Another criteria to evaluate the environment is to understand where the
servers that comprise the distributed system are physically located. Are they
located in your company data center? Are they in a third-party–managed data
center? Are they in a public cloud infrastructure? Understanding the answer
to these questions will start to frame the problem of providing a security
assessment. For example, if the distributed system is hosted in a public
cloud, a few threats are immediately apparent: the infrastructure is not owned
by your company, so you do not definitively know who has direct access to
the machines. This expands the scope of insider threat to include your hosting
provider. Also, the usage of a public cloud begs the question of how your
users are connecting to the distributed system and how data flows into and
out of it. Again, threats to communications that occur across an open network
to a shared public cloud have a much higher risk of happening than those that
are within your own company data center.
Vulnerabilities are a separate topic, but they are related to the discussion of
threats and risk. Vulnerabilities exist in a variety of different forms in a
distributed system. A common place for vulnerabilities is in the software
itself. All software has vulnerabilities. This might seem like a harsh
statement, but the truth of it is that no piece of software is 100% secure.
So what exactly is a software vulnerability? Put simply, it’s a piece of code
that is susceptible to some kind of error or failure condition that is not
accounted for gracefully. For instance, consider the simple example of a
piece of software with a password screen that allows users to change their
password (we will assume that the intended logic for the software is to
allow passwords up to 16 characters in length). What happens if the input
field for a new password mistakenly has a maximum length of 8 characters,
and thus truncates the chosen password? This could lead to users setting
shorter passwords than they realized, and worse, less complex passwords
that are easier for an attacker to guess.
Certainly, software vulnerabilities are not the only type of vulnerabilities that
distributed systems are susceptible to. Other vulnerabilities include those
related to the network infrastructure that a distributed system relies on. For
example, many years ago there was a vulnerability that allowed an attacker
to send a ping to a network broadcast address, causing every host in the
network range to reply with a ping response. The attacker crafted the ping
request so that the source IP address was set to a computer that was the
intended target of the attack. The result was that the target host of the attack
was overwhelmed with network communication to the point of failure. This
attack was known as the ping of death. It has been mitigated, but the point is
that until this was fixed by network hardware vendors, this was a
vulnerability that had nothing to do with the software stack of machines on
the network, yet an attacker could use it to disrupt the service of a particular
machine on the network.
This idea of deploying multiple security controls and protection methods is
called defense in depth.
Looking back in history, defense in depth was not regularly followed.
Security typically meant perimeter security, in that security controls existed
only on the outside, or perimeter, of whatever was to be protected. A
canonical example of this is imagining a thick, tall wall surrounding a castle.
The mindset was that as long as the wall stood, the castle was safe. If the
wall was breached, that was bad news for the castle dwellers. Today, things
have gotten better.
Defense-in-depth security now exists in our everyday lives. Take the
example of going to a grocery store. The grocery store has a door with a lock
on it, and is only unlocked during normal business hours. There is also an
alarm system that is triggered if an intruder illegally enters the building after
hours. During regular hours, shoppers are monitored with security cameras
throughout the store. Finally, store employees are trained to watch for patrons
behaving suspiciously.
All of these security measures are in place to protect the grocery store from a
variety of different threats, such as break-ins, shoplifters, and robberies. Had
the grocery store only relied on the “castle wall” approach by only relying on
strong door locks, most threats would not be addressed. Defense in depth is
important here because any single security measure is not likely to mitigate
all threats to the store. The same is true for distributed systems. There are
many places where individual security measures can be deployed, such as
setting up a network firewall around the perimeter, restrictive permissions on
data, or access controls to the servers. But implementing all of these
measures together helps to lower the chances that an attack will be
successful.
The next chapter focuses on protecting Hadoop in particular, and building a
sound system architecture is the first step.
Chapter 3. System Architecture
In Chapter 2, we took a look at how the security landscape changes when
going from individual isolated systems to a fully distributed network of
systems. It becomes immediately apparent just how daunting a task it is to
secure hundreds if not thousands of servers in a single Hadoop cluster. In this
chapter, we dive into the details of taking on this challenge by breaking the
cluster down into several components that can independently be secured as
part of an overall security strategy. At a high level, the Hadoop cluster can be
divided into two major areas: the network and the hosts. But before we do
this, let’s explore the operating environment in which the Hadoop cluster
resides.
In the early days of Hadoop, a cluster likely meant a hodgepodge of
repurposed machines used to try out the new technology. You might even
have used old desktop-class machines and a couple of extra access switches
to wire them up. Things have changed dramatically over the years. The days
of stacking a few machines in the corner of a room has been replaced by the
notion that Hadoop clusters are first-class citizens in real enterprises. Where
Hadoop clusters physically and logically fit into the enterprise is called the
operating environment.
Numerous factors that contribute to the choice of operating environment for
Hadoop are out of scope of this book. We will focus on the typical operating
environments in use today. As a result of rapid advances in server and
network hardware (thank Moore’s law), Hadoop can live in a few different
environments:
This Hadoop environment consists of a collection of physical (“bare
metal”) machines that are owned and operated by the business, and live
in data centers under the control of the business.
This Hadoop environment is a variation of in-house in that it consists of
physical machines, but the business does not own and operate them. They
are rented from a separate business that handles the full provisioning and
maintenance of the servers, and the servers live in their own data centers.
The first option is physical network segmentation. This is achieved by
sectioning off a portion of the network with devices such as routers,
switches, and firewalls. While these devices operate at higher layers of the
OSI model, from a physical-layer point of view the separation is just that all
devices on one network segment are physically plugged into network devices
that are separate from other devices on the larger network.
The second option is logical network segmentation. Logical segmentation
operates at higher layers of the OSI model, most commonly at the network
layer using Internet Protocol (IP) addressing. With logical separation,
devices in the same network segment are grouped together in some way. The
most common way this is achieved is through the use of network subnets. For
example, if a Hadoop cluster has 150 nodes, it may be that these nodes are
logically grouped on the same /24 subnet (e.g., an IP subnet mask of
255.255.255.0), which represents a maximum of 256 IP addresses (254
usable). Organizing hosts logically in this fashion makes it easy to administer
and secure.
The most common method of network segmentation is a hybrid approach that
uses aspects of both physical and logical network segmentation. The most
common way of implementing the hybrid approach is through the use of
virtual local area networks (VLANs). VLANs allow multiple network
subnets to share physical switches. Each VLAN is a distinct broadcast
domain even though all VLANs share a single layer-2 network. Depending on
the capabilities of the network switches or routers, you might have to assign
each physical port to a single VLAN or you may be able to take advantage of
packet tagging to run multiple VLANs over the same port.
As briefly mentioned before, both physical and logical separation can be,
and often are, used together. Physical and logical separation may be present
in the in-house and managed environments where a Hadoop cluster has a
logical subnet defined, and all machines are physically connected to the same
group of dedicated network devices (e.g., top-of-rack switches and
aggregation switches).
With the cloud operating environment, physical network segmentation is often
more difficult. Cloud infrastructure design goals are such that the location of
hardware is less important than the availability of services sized by
operational need. Some cloud environments allow for users to choose
machines to be in the same locality group. While this is certainly better from
a performance point of view, such as in the case of network latencies, it does
not usually help with security. Machines in the same locality group likely
share the same physical network as other machines.
On the surface, it might seem that network firewalls are separate pieces of
hardware in addition to other network hardware such as routers and
switches, but this is not always true. Modern routers and (multilayer)
switches often perform many of the same core functions as standalone
firewalls. Several key points about firewalls that are important in the context
of Hadoop are discussed in this section.
The importance of basic filtering for Hadoop is often based on three general
categories: data movement to and from the cluster; client access, which
includes end users and third-party tools; and administration traffic. Each of
these general categories carries a different perspective on how network
firewalls will be used to ensure a secure network path between the Hadoop
cluster and everything else.
Data movement
The first category, data movement, is how data is ingested into the cluster or
served to downstream systems. A detailed discussion about securing these
flows from a Hadoop ecosystem perspective takes place in Chapter 10. For
now, the focus is on the network channel for these transfers and the type of
data involved to determine the level of firewall inspection required.
Which machines contain the source data? Is the data landing on a server on
the edge of the cluster before being ingested into the cluster? Which machines
are receiving extracted data from the cluster? Answers to these questions
lead to network firewall rules that at a high level could:
Permit FTP traffic from a limited set of FTP servers to one or more edge
nodes (described later in this chapter)
Permit worker nodes in the cluster to connect to one or more database
servers to send and receive data over specified ports
Permit data flowing from log events generated from a cluster of web
servers to a set of Flume agents over a limited number of ports
A follow-up decision that needs to be made is determining where source data
is coming from and if additional firewall inspection is needed. For example,
if an upstream data source is coming from an internal business system, the
firewall policies highlighted are sufficient. However, if an upstream data
source comes from an untrusted source, such as data provided on the open
Internet, it is likely that deep packet inspection is required to help protect the
cluster from malicious content.
Client access
The second common category is all about client access. Again, this subject is
covered in detail in Chapter 11, but what is important from a network
firewall point of view is to understand, and thus classify, the methods clients
will be using to interact with the cluster. Some clusters will operate in a fully
“lights out” environment, meaning that there is no end-user activity permitted.
These types of environments typically run continuous ETL jobs and generate
result sets and reports to downstream systems in a fully automated fashion. In
this environment, client access policies exist simply to block everything. The
only policies necessary to keep the cluster up and running and secure are
those of the data movement and administration variety.
A more typical environment is a mixed environment of users, tools, and
applications accessing the cluster. In this case, organization is key. Where are
the third-party tools running? Can they be isolated to a few known machines?
Where are the users accessing the cluster from? Is it possible to require users
to use an edge node? Where are custom applications running? Is the network
firewall between the application and the cluster, or between the application
and end users?
Administration traffic
The last common category is administration traffic. This includes things like
administrative users logging into cluster machines, audit event traffic from
the cluster to an external audit server, and backup traffic from the cluster to
another network. Backups could be large data transfers using DistCp, or even
backing up the Hive metastore database to a location outside the cluster’s
data center. The term administration traffic is not meant to give a sense of
volume but rather to indicate that the traffic is not something that regular
clients to the cluster generate.
Intrusion Detection and Prevention
Intrusion detection systems (IDS) and intrusion prevention systems (IPS)
are often used interchangeably in the discussion of network security.
However, these two systems are fundamentally different in the role they play
in dealing with suspected intrusions. An IDS, as the name implies, detects an
intrusion. It falls under the same category as monitoring and alerting systems.
An IDS is typically connected to a switch listening in promiscuous mode,
meaning that all traffic on the switch flows to the IDS in addition to the
intended destination port(s). When an IDS finds a packet or stream of packets
it suspects as an attack, it generates an alert. An alert might be an event that
gets sent to a separate monitoring system, or even just an email alias that
security administrators subscribe to. Figure 3-1 shows the network diagram
when an IDS is in place; you will notice that the IDS is not in the network
flow between the outside network and the cluster network.
Figure 3-1. Network diagram with an IDS

An IPS, on the other hand, not only detects an intrusion, but actively tries to
prevent or stop the intrusion as it is happening. This is made possible by the
key difference between an IDS and IPS in that an IPS is not listening
promiscuously on the network, but rather sitting between both sides of the
network. Because of this fact, an IPS can actually stop the flow of intrusions
to the other side of the network. A common feature of an IPS is to fail close.
This means that upon failure of the IPS, such as being overwhelmed by an
extensive DDoS attack to the point where it can no longer scan packets, it
simply stops all packets from flowing through to the other side of the IPS.
While this might seem like a successful DDoS attack, and in some ways it is,
a fail close protects all the devices that are behind the IPS. Figure 3-2 shows
the network diagram when an IPS is in place; you will notice that the IPS is
actually in the network flow between the outside network and the cluster
network.
Figure 3-2. Network diagram with an IPS

Now that we have the 50,000-foot view of what these devices do, how does
it help Hadoop? The answer is that it is another piece of the network security
puzzle. Hadoop clusters inherently store massive amounts of data. Both
detection and prevention of intrusion attempts to the cluster are critical to
protecting the large swath of data. So where do these devices live relative to
the rest of the network in which a Hadoop cluster lives? The answer:
possibly several places.
Placing an IDS inside the trusted network can be a valuable tool to warn
administrators against the insider threat.
Hadoop Roles and Separation Strategies
roles do, but if that’s not the case, refer back to Chapter 1 for a quick
review):
HDFS
NameNode (Active/Standby/Secondary), DataNode, JournalNode,
FailoverController, HttpFS, NFSGateway
MapReduce
JobTracker (Active/Standby), TaskTracker, FailoverController
YARN
ResourceManager (Active/Standby), NodeManager, JobHistory Server
Hive
Hive Metastore Server, HiveServer2, WebHCatServer
Impala
Catalog Server, StateStore Server, Impalad
Hue
HueServer, Beeswax, KerberosTicketRenewer
Oozie
OozieServer
ZooKeeper
ZooKeeper Server
HBase
Master, RegionServer, ThriftServer, RESTServer
Accumulo
Master, TabletServer, Tracer, GarbageCollector
Solr
SolrServer
Management and monitoring services
Cloudera Manager, Apache Ambari, Ganglia, Nagios, Puppet, Chef, etc.
Looking at this (nonexhaustive) list, you can see that many of the various
ecosystem projects have a master/worker architecture. This lends itself well
to organizing the service roles from a security architecture perspective.
Additionally, some of the service roles are intended to be client-facing.
Overall, the separation strategy is this: identify all of the master services to
be run on master nodes, worker services on worker nodes, and management
services on management nodes. Additionally, identify which components
require client configuration files to be deployed such that users can access
the services. These client configuration files, along with client-facing
services, are placed on edge nodes. The classifications of nodes are
explained in more detail in the following subsections.
HDFS NameNode, Secondary NameNode (or Standby NameNode),
FailoverController, JournalNode, and KMS
MapReduce JobTracker and FailoverController
YARN ResourceManager and JobHistory Server
Hive Metastore Server
Impala Catalog Server and StateStore Server
Sentry Server
ZooKeeper Server
HBase Master
Accumulo Master, Tracer, and GarbageCollector
Armed with this list of services, the first security question to ask is: Who
needs access to a master node and for what purpose? The simple answer is
administrators,to perform administrative functions (surprise, surprise).
Clients to the cluster, be it actual end users or third-party tools, can access
all of these services remotely using the standard interfaces that are exposed.
For example, a user issuing the command hdfs dfs -ls can do so on any
machine that has the proper client configuration for the HDFS service. The
user does not need to execute this command on the master node that is running
the HDFS NameNode for it to succeed. With that in mind, here are several
important reasons for limiting access to master nodes to administrators:
Resource contention
If regular end users are able to use master nodes to run arbitrary
programs and thus use system resources, this takes away resources that
may otherwise be needed by the master node roles. This can lead to a
degradation of performance.
Security vulnerabilities
Software has inherent vulnerabilities in it, and Hadoop is no different.
Allowing users to have access to the same machines that have master
node roles running can open the door for exploiting unpatched
vulnerabilities in the Hadoop code (maliciously or accidentally).
Restricting access to master nodes lowers the risk of exposing these
security vulnerabilities.
Denial of service
Users can do crazy things. There isn’t really a nicer way to say it. If end
users are sharing the same machines as master node roles, it inevitably
sets the stage for a user to do something (for the sake of argument,
accidentally) that will take down a master process. Going back to the
resource contention argument, what happens if a user launches a runaway
process that fills up the log directory? Will all of the master node roles
handle it gracefully if they are unable to log anymore? Does an
administrator want to find out? Another example would be a similar case
HDFS DataNode
MapReduce TaskTracker
YARN NodeManager
Impala Daemon
HBase RegionServer
Accumulo TabletServer
SolrServer
On the surface, it might seem like all cluster users need access to these nodes
because these roles handle user requests for data and processing. However,
this is most often not true. Typically, only administrators need remote access
to worker nodes for maintenance tasks. End users can ingest data, submit
jobs, and retrieve records by utilizing the corresponding interfaces and APIs
available. Most of the time, as will be elaborated on a bit later, services
provide a proxy mechanism that allows administrators to channel user
activity to a certain set of nodes different from the actual worker nodes.
These proxies communicate with worker nodes on behalf of the user,
eliminating the need for direct access.
As with master nodes, there are reasons why limiting access to worker nodes
to administrators makes sense:
Resource contention
When regular end users are performing activities on a worker node
outside the expected processes, it can create skew in resource
management. For example, if YARN is configured to use a certain amount
of system resources based on a calculation done by a Hadoop
administrator taking into account the operating system needs and other
software, what about end-user activity? It is often difficult to accurately
profile user activity and account for it, so it is quite likely that heavily
used worker nodes will not perform well or predictably compared to
worker nodes that are not being used.
Worker role skew
Configuration management
Monitoring
Alerting
Software repositories
Backend databases
These management nodes often contain the actual software repositories for
the cluster. This is especially the case when the nodes in the Hadoop cluster
do not have Internet access. The most critical role hosted on a management
node is configuration management software. Whether it is Hadoop specific
(e.g., Cloudera Manager, Apache Ambari) or not (e.g., Puppet, Chef), this is
the place where administrators will set up and configure the cluster. The
corollary to configuration management is monitoring and alerting. These
roles are provided by software packages like Ganglia, Nagios, and the
Hadoop-specificmanagement consoles.
HDFS HttpFS and NFS gateway
Hive HiveServer2 and WebHCatServer
Network proxy/load balancer for Impala
Hue server and Kerberos ticket renewer
Oozie server
HBase Thrift server and REST server
Flume agent
Client configuration files
housing client configurations to facilitate command-line access would be
accessible by users. How granular the classification of nodes within the edge
node group will be dependent on a variety of factors, including cluster size
and use cases. Here are some examples of further classifying edge nodes:
Data Gateway
HDFS HttpFS and NFS gateway, HBase Thrift server and REST server,
Flume agent
SQL Gateway
Hive HiveServer2 and WebHCatServer, Impala load-balancing proxy
(e.g., HAProxy)
User Portal
Hue server and Kerberos ticket renewer, Oozie server, client
configuration files
TIP
While the Impala daemon does not have to be collocated with an HDFS DataNode, it is
not recommended to use a standalone Impala daemon as a proxy. A better option is to use
a load-balancing proxy, such as HAProxy, to act as a load balancer. This is the
recommended architecture in the case where clients cannot connect directly to an Impala
daemon on a worker node because of a firewall or other restrictions.
Using the additional edge node classifications shown, it becomes easier to
break down which nodes users are expected to have remote access to, and
which nodes are only accessible remotely through the configured remote
ports. While users need remote access to the user portal nodes to interact
with the cluster from a shell, it is quite reasonable that both the data and SQL
gateways are not accessible in this way. These nodes are accessible only via
remote ports, which facilitates access to both command-line tools executed
on the user portal, as well as additional business intelligence tools that might
reside somewhere else in the network.
This section digs into how individual nodes should be protected at the
operating-system level.
While Hadoop clusters can span thousands of nodes, these nodes can be
classified into groups, as we will see a bit later in this chapter. With that in
mind, it is important to consider limiting remote access to machines by
identifying which machines need to be accessed and why. Armed with this
information, a remote access policy can be made to restrict remote access
(typically SSH) to authorized users. On the surface, it might seem that
authorized users are analogous to users of the Hadoop cluster, but this is
typically not the case. For example, a developer writing Java MapReduce
code or Pig scripts will likely require command-line access to one or more
nodes in the cluster, whereas an analyst writing SQL queries for Hive and
Impala might not need this access at all if they are using Hue or third-party
business intelligence (BI) tools to interact with the cluster.
Remote access controls are a good way to limit which users are able to log
into a given machine in the cluster. This is useful and necessary, but it is only
a small component of protecting a given machine in the cluster. Host
firewalls are an incredibly useful tool to limit the types of traffic going into
and out of a node. In Linux systems, host firewalls are typically implemented
using iptables. Certainly there are other third-party software packages that
perform this function as well (e.g., commercial software), but we will focus
on iptables, as it is largely available by default in most Linux distributions.
In order to leverage iptables, we must first understand and classify the
network traffic in a Hadoop cluster. Table 3-1 shows common ports that are
used by Hadoop ecosystem components. We will use this table to start
building a host firewall policy for iptables.
Table 3-1. Common Hadoop service ports
Component | Service | Port(s) |
Accumulo | Master | 9999 |
GarbageCollector | 50091 | |
Tracer | 12234 | |
ProxyServer | 42424 | |
TabletServer | 9997 | |
Monitor | 4560, 50095 | |
Cloudera Impala | Catalog Server | 25020, 26000 |
StateStore | 24000, 25010 | |
Daemon | 21000, 21050, 22000, 23000, 25000, 28000 | |
Llama ApplicationMaster | 15000, 15001, 15002 | |
Flume | Agent | 41414 |
HBase | Master | 60000, 60010 |
REST Server | 8085, 20550 |
Component | Service | Port(s) |
Thrift Server | 9090, 9095 | |
RegionServer | 60020, 60030 | |
HDFS | NameNode | 8020, 8022, 50070, 50470 |
SecondaryNameNode | 50090, 50495 | |
DataNode | 1004, 1006, 50010, 50020, 50075, 50475 | |
JournalNode | 8480, 8485 | |
HttpFS | 14000, 14001 | |
NFS Gateway | 111, 2049, 4242 | |
KMS | 16000, 16001 | |
Hive | Hive Metastore Server | 9083 |
HiveServer2 | 10000 | |
WebHCat Server | 50111 | |
Hue | Server | 8888 |
MapReduce | JobTracker | 8021, 8023, 9290, 50030 |
FailoverController | 8018 | |
TaskTracker | 4867, 50060 | |
Oozie | Server | 11000, 11001, 11443 |
Sentry | Server | 8038, 51000 |
Solr | Server | 8983, 8984 |
YARN | ResourceManager | 8030, 8031, 8032, 8033, 8088, 8090 |
JobHistory Server | 10020, 19888, 19890 | |
NodeManager | 8040, 8041, 8042, 8044 | |
ZooKeeper | Server | 2181, 3181, 4181, 9010 |
Now that we have the common ports listed, we need to understand how strict
of a policy needs to be enforced. Configuring iptables rules involves both
ports and IP addresses, as well as the direction of communication. A typical
basic firewall policy allows any host to reach the allowed ports, and all
return (established) traffic is allowed. An example iptables policy for an
HDFS NameNode might look like the one in Example 3-1.
Example 3-1. Basic NameNode iptables policy
iptables | -N | hdfs | ||||||
iptables | -A | hdfs | -p | tcp | -s | 0.0.0.0/0 | --dport | 8020 -j ACCEPT |
iptables | -A | hdfs | -p | tcp | -s | 0.0.0.0/0 | --dport | 8022 -j ACCEPT |
iptables | -A | hdfs | -p | tcp | -s | 0.0.0.0/0 | --dport | 50070 -j ACCEPT |
iptables | -A | hdfs | -p | tcp | -s | 0.0.0.0/0 | --dport | 50470 -j ACCEPT |
iptables -A INPUT -j hdfs
This policy is more relaxed in that it allows all hosts (0.0.0.0/0) to connect
to the machine over the common HDFS NameNode service ports. However,
this might be too open a policy. Let us say that the Hadoop cluster nodes are
all part of the 10.1.1.0/24 subnet. Furthermore, a dedicated edge node is set
up on the host 10.1.1.254 for all communication to the cluster. Finally, SSL is
enabled for web consoles. The adjusted iptables policy for the NameNode
machine might instead look like the one in Example 3-2.
Example 3-2. Secure NameNode iptables policy
iptables -N hdfs
iptables -A hdfs -p tcp -s 10.1.1.254/32 --dport 8020 -j ACCEPT
iptables -A hdfs -p tcp -s 10.1.1.254/32 --dport 8022 -j DROP
iptables -A hdfs -p tcp -s 10.1.1.0/24 --dport 8022 -j ACCEPT
iptables -A hdfs -p tcp -s 0.0.0.0/0 --dport 50470 -j ACCEPT
iptables -A INPUT -j hdfs
The adjusted policy is now a lot more restrictive. It allows any user to get to
the NameNode web console over SSL (port 50470), only cluster machines to
connect to the NameNode over the dedicated DataNode RPC port (8022),
and user traffic to the NameNode RPC port (8020) to occur only from the
edge node.
NOTE
It might be necessary to insert the iptables jump target to a specific line number in the
INPUT section of your policy for it to take effect. An append is shown for simplicity.
Another often-discussed feature related to operating system security is
Security Enhanced Linux (SELinux), which was originally developed by the
National Security Agency (NSA), an intelligence organization in the United
States. The premise of SELinux is to provide Linux kernel enhancements that
allow for the policies and enforcement of mandatory access controls(MAC). At a high level, SELinux can be configured in a few different ways:
Disabled
In this mode, SELinux is not active and does not provide any additional
level of security to the operating system. This is far and away the most
common configuration for Hadoop.
Permissive
In this mode, SELinux is enabled but does not protect the system. What it
does instead is print warnings when a policy has been violated. This
mode is very useful to profile the types of workloads on a system to
begin building a customized policy.
Enforcing
In this mode, SELinux is enabled and protects the system based upon the
specified SELinux policy in place.
In addition to the enabled modes of permissive and enforcing, SELinux has
two different types of enforcement: targeted enforcement and multilevel(MLS). With targeted enforcement, only certain processes are
targeted, meaning they have an associated policy that governs the protection.
Processes that do not have a policy are not protected by SELinux. This, of
course, is a less stringent mode of protection. MLS, on the other hand, is
much more in depth. The premise of MLS at a very high level is that all users
and processes carry a security level, while files and other objects carry a
security-level requirement. MLS is modeled after U.S. government
classification levels, such as Top Secret, Secret, Confidential, and
Unclassified. In the U.S. government classification system, these levels
create a hierarchy where each user with a given level of access has
permission to any information at a lower level. For example, if a user has a
security level of Secret, then the user will be permitted to access objects in
the operating system at the Secret, Confidential, and Unclassified security
level because Confidential and Unclassified are both lower levels than
Secret. However, they would not be able to access objects marked at the Top
Secret security level.
All of this sounds great, but what does it have to do with Hadoop? Can
SELinux be used as an additional level of protection to the operating system
that is running the various Hadoop ecosystem components? The short answer:
most likely not. This is not to say that it is not possible—rather, it is an
admission that advancements in security integration with SELinux and the
creation of associated policies that security administrators can deploy in the
cluster are simply absent at this point. What compounds the problem is the
nature of the Hadoop ecosystem. Today it is filled with hundreds of
components, tools, and other widgets that integrate and/or enhance the
platform in one way or another. The more tools that are added in the mix, the
harder it is to come up with a set of SELinux policies to govern them all.
For those that push the limits of adoption, the likely choice is to set up
systems in permissive mode and run what equates to “normal” workloads in
the cluster, leveraging as many of the tools as deemed typical for the given
environment. Once this has been done over a suitable period of time, the
warnings generated by SELinux can be used to start building out a policy.
In this chapter, we analyzed the Hadoop environment with broad strokes, first
identifying the operating environment that it resides in. Then we discussed
protecting this environment from a network security perspective, taking
advantage of common security practices such as network segmentation and
introducing network security devices like firewalls and IDS/IPS. The next
level of granularity was understanding how to break down a Hadoop cluster
into different node groups based upon the types of services they run. Finally,
we provided recommendations for securing the operating systems of
individual nodes based on the node group.
In Chapter 4, we take a look at a fundamental component of Hadoop security
architecture: Kerberos. Kerberos is a key player in enterprise systems, and
Hadoop is no exception. The Kerberos chapter will close out the discussion
on security architecture and set the stage for authentication, authorization, and
accounting.
Kerberos often intimidates even experienced system administrators and
developers at the first mention of it. Applications and systems that rely on
Kerberos often have many support calls and trouble tickets filed to fix
problems related to it. This chapter will introduce the basic Kerberos
concepts that are necessary to understand how strong authentication works,
and explain how it plays an important role with Hadoop authentication in
Chapter 5.
So what exactly is Kerberos? From a mythological point of view, Kerberos
is the Greek word for Cerberus, a multiheaded dog that guards the entrance
to Hades to ensure that nobody who enters will ever leave. Kerberos from a
technical (and more pleasant) point of view is the term given to an
authentication mechanism developed at Massachusetts Institute of Technology
(MIT). Kerberos evolved to become the de facto standard for strong
authentication for computer systems large and small, with varying
implementations ranging from MIT’s Kerberos distribution to the
authentication component of Microsoft’s Active Directory.
To use an analogy, if a person at a party approached you and introduced
himself as “Bill,” you naturally would believe that he is, in fact, Bill. How
do you know that he really is Bill? Well, because he said so and you
believed him without question. Hadoop without Kerberos behaves in much
the same way, except that, to take the analogy a step further, Hadoop not only
believes “Bill” is who he says he is but makes sure that everyone else
believes it, too. This is a problem.
First, identities in Kerberos are called principals. Every user and service
that participates in the Kerberos authentication protocol requires a principal
to uniquely identify itself. Principals are classified into two categories: userprincipals and service principals. User principal names, or UPNs, represent
regular users. This closely resembles usernames or accounts in the operating
system world. Service principal names, or SPNs, represent services that a
user needs to access, such as a database on a specific server. The
relationship between UPNs and SPNs will become more apparent when we
work through an example later.
The next important Kerberos term is realm. A Kerberos realm is an
authentication administrative domain. All principals are assigned to a
specific Kerberos realm. A realm establishes a boundary, which makes
administration easier.
Now that we have established what principals and realms are, the natural
next step is to understand what stores and controls all of this information.
The answer is a key distribution center (KDC). The KDC is comprised of
three components: the Kerberos database, the authentication service (AS),
and the ticket-granting service (TGS). The Kerberos database stores all the
information about the principals and the realm they belong to, among other
things. Kerberos principals in the database are identified with a naming
convention that looks like the following:
A UPN that uniquely identifies the user (also called the short name):
alice in the Kerberos realm EXAMPLE.COM. By convention, the realm
name is always uppercase.
A variation of a regular UPN in that it identifies an administrator bob for
the realm EXAMPLE.COM. The slash (/) in a UPN separates the short name
and the admin distinction. The admin component convention is regularly
used, but it is configurable as we will see later.
hdfs/node1.example.com@EXAMPLE.COM
This principal represents an SPN for the hdfs service, on the host
node1.example.com, in the Kerberos realm EXAMPLE.COM. The slash (/)
in an SPN separates the short name hdfs and the hostname
node1.example.com.
NOTE
The entire principal name is case sensitive! For instance,
hdfs/Node1.Hadoop.com@EXAMPLE.COM is a different principal than the one in the
third example. Typically, it is best practice to use all lowercase for the principal, except for
the realm component, which is uppercase. The caveat here is, of course, that the
underlying hostnames referred to in SPNs are also lowercase, which is also a best practice
for host naming and DNS.
The second component of the KDC, the AS, is responsible for issuing a
ticket-granting ticket (TGT) to a client when they initiate a request to the AS.
The TGT is used to request access to other services.
The third component of the KDC, the TGS, is responsible for validating
TGTs and granting service tickets. Service tickets allow an authenticated
principal to use the service provided by the application server, identified by
the SPN. The process flow of obtaining a TGT, presenting it to the TGS, and
obtaining a service ticket is explained in the next section. For now,
understand that the KDC has two components, the AS and TGS, which handle
requests for authentication and access to services.
NOTE
There is a special principal of the form krbtgt/<REALM>@<REALM> within the Kerberos
database, such as krbtgt/EXAMPLE.COM@EXAMPLE.COM. This principal is used
internally by both the AS and the TGS. The key for this principal is actually used to
encrypt the content of the TGT that is issued to clients, thus ensuring that the TGT issued
by the AS can only be validated by the TGS.
Table 4-1 provides a summary of the Kerberos terms and abbreviations
introduced in this chapter.
Table 4-1. Kerberos term abbreviations
Term | Name | Description |
UPN | User principal | A principal that identifies a user in a given realm, with the format <shortname><@REALM> or <shortname>/admin@<REALM> |
SPN | Service | A principal that identifies a service on a specific host in a given realm, |
TGT | Ticket- | A special ticket type granted to a user after successfully authenticating to |
KDC | Key | A Kerberos server that contains three components: Kerberos database, |
Term Name Description
AS Authentication
service
A KDC service that issues TGTs
TGS Ticket-
granting
service
A KDC service that validates TGTs and grants service tickets
What has been presented thus far are a few of the basic Kerberos components
needed to understand authentication at a high level. Kerberos in its own right
is a very in-depth and complex topic that warrants an entire book on the
subject. Thankfully, that has already been done. If you wish to dive far
deeper than what is presented here, take a look at Jason Garman’s excellent
book, Kerberos: The Definitive Guide (O’Reilly).
Kerberos Workflow: A Simple Example
EXAMPLE.COM
The Kerberos realm
Alice
A user of the system, identified by the UPN alice@EXAMPLE.COM
myservice
A service that will be hosted on server1.example.com, identified by
the SPN myservice/server1.example.com@EXAMPLE.COM
kdc.example.com
The KDC for the Kerberos realm EXAMPLE.COM
In order for Alice to use myservice, she needs to present a valid service
ticket to myservice. The following list of steps shows how she does this
(some details omitted for brevity):
Alice needs to obtain a TGT. To do this, she initiates a request to the
AS at kdc.example.com, identifying herself as the principal
alice@EXAMPLE.COM.
The AS responds by providing a TGT that is encrypted using the key
(password) for the principal alice@EXAMPLE.COM.
Upon receipt of the encrypted message, Alice is prompted to enter the
correct password for the principal alice@EXAMPLE.COM in order to
decrypt the message.
After successfully decrypting the message containing the TGT, Alice
now requests a service ticket from the TGS at kdc.example.com for
the service identified by
myservice/server1.example.com@EXAMPLE.COM, presenting the
TGT along with the request.
The TGS validates the TGT and provides Alice a service ticket,
encrypted with the myservice/server1.example.com@EXAMPLE.COM
principal’s key.
Alice now presents the service ticket to myservice, which can then
decrypt it using the myservice/server1.example.com@EXAMPLE.COM
key and validate the ticket.
The service myservice permits Alice to use the service because she
has been properly authenticated.
This shows how Kerberos works at a high level. Obviously this is a greatly
simplified example and many of the underlying details have not been
presented. See Figure 4-1 for a sequence diagram of this example.
Figure 4-1. Kerberos workflow example

For example, suppose that Example is a very large corporation and has
decided to create multiple realms to identify different lines of business,
including HR.EXAMPLE.COM and MARKETING.EXAMPLE.COM. Because users in
both realms might need to access services from both realms, the KDC for
HR.EXAMPLE.COM needs to trust information from the
MARKETING.EXAMPLE.COM realm and vice versa.
On the surface this seems pretty straightforward, except that there are
actually two different types of trusts: one-way trust and two-way trust(sometimes called bidirectional trust or full trust). The example we just
looked at represents a two-way trust.
What if there is also a DEV.EXAMPLE.COM realm where developers have
principals that need to access the DEV.EXAMPLE.COM and
MARKETING.EXAMPLE.COM realms, but marketing users should not be able to
access the DEV.EXAMPLE.COM realm? This scenario requires a one-way trust.
A one-way trust is very common in Hadoop deployments when a KDC is
installed and configured to contain all the information about the SPNs for the
cluster nodes, but all UPNs for end users exist in a different realm, such as
Active Directory. Oftentimes, Active Directory administrators or corporate
policies prohibit full trusts for a variety of reasons.
So how does a Kerberos trust actually get established? Earlier in the chapter
it was noted that a special principal is used internally by the AS and TGS,
and it is of the form krbtgt/<REALM>@<REALM>. This principal becomes
increasingly important for establishing trusts. With trusts, the principal
instead takes the form of krbtgt/<TRUSTING_REALM>@<TRUSTED_REALM>.
A key concept of this principal is that it exists in both realms. For example,
if the HR.EXAMPLE.COM realm needs to trust the MARKETING.EXAMPLE.COM
realm, the principal krbtgt/HR.EXAMPLE.COM@MARKETING.EXAMPLE.COM
needs to exist in both realms.
WARNING
The password for the krbtgt/<TRUSTING_REALM>@<TRUSTED_REALM> principal and
the encryption types used must be the same in both realms in order for the trust to be
established.
The previous example shows what is required for a one-way trust. In order
to establish a full trust, the principal
krbtgt/MARKETING.EXAMPLE.COM@HR.EXAMPLE.COM also needs to exist in
both realms. To summarize, for the HR.EXAMPLE.COM realm to have a full
trust with the MARKETING.EXAMPLE.COM realm, both realms need the
principals krbtgt/MARKETING.EXAMPLE.COM@HR.EXAMPLE.COM and
krbtgt/HR.EXAMPLE.COM@MARKETING.EXAMPLE.COM.
As mentioned in the beginning of this chapter, Kerberos was first created at
MIT. Over the years, it has undergone several revisions and the current
version is MIT Kerberos V5, or krb5 as it is often called. This section
covers some of the components of the MIT Kerberos distribution to put some
real examples into play with the conceptual examples introduced thus far.
TIP
For the most up-to-date definitive resource on the MIT Kerberos distribution, consult the
excellent documentation at the official project website.
In the earlier example, we glossed over the fact that Alice initiated an
authentication request. In practice, Alice does this by using the kinit tool
(Example 4-1).
Example 4-1. kinit using the default user
[alice@server1 ~]$ kinit
Enter password for alice@EXAMPLE.COM:
[alice@server1 ~]$
This example pairs the current Linux username alice with the default realmto come up with the suggested principal alice@EXAMPLE.COM. The default
realm is explained later when we dive into the configuration files. The kinit
tool also allows the user to explicitly identify the principal to authenticate as
(Example4-2).
Example 4-2. kinit using a specified user
[alice@server1 ~]$ kinit alice/admin@EXAMPLE.COMalice/admin@EXAMPLE.COM:
[alice@server1 ~]$
Explicitly providing a principal name is often necessary to authenticate as an
administrative user, as the preceding example depicts. Another option for
authentication is by using a keytab file. A keytab file stores the actual
encryption key that can be used in lieu of a password challenge for a given
principal. Creating keytab files are useful for noninteractive principals, such
as SPNs, which are often associated with long-running processes like
Hadoop daemons. A keytab file does not have to be a 1:1 mapping to a single
principal. Multiple different principal keys can be stored in a single keytab
file. A user can use kinit with a keytab file by specifying the keytab file
location, and the principal name to authenticate as (again, because multiple
principal keys may exist in the keytab file), shown in Example 4-3.
Example 4-3. kinit using a keytab file
[alice@server1 ~]$ kinit -kt alice.keytab alice/admin@EXAMPLE.COM
[alice@server1 ~]$
TIP
The keytab file allows a user to authenticate without knowledge of the password. Because
of this fact, keytabs should be protected with appropriate controls to prevent unauthorized
users from authenticating with it. This is especially important when keytabs are created for
administrativeprincipals!
Another useful utility that is part of the MIT Kerberos distribution is called
klist. This utility allows users to see what, if any, Kerberos credentials
they have in their credentials cache. The credentials cache is the place on
the local filesystem where, upon successful authentication to the AS, TGTs
are stored. By default, this location is usually the file /tmp/krb5cc_<uid>where <uid> is the numeric user ID on the local system. After a successful
kinit, alice can view her credentials cache with klist, as shown in
Example 4-4.
Example 4-4. Viewing the credentials cache with klist
[alice@server1 ~]$ kinit
Enter password for alice@EXAMPLE.COM:
[alice@server1 ~]$ klist
Ticket cache: FILE:/tmp/krb5cc_5000
Default principal: alice@EXAMPLE.COM
Valid starting Expires Service principal
02/13/14 12:00:27 02/14/14 12:00:27 krbtgt/EXAMPLE.COM@EXAMPLE.COM
renew until 02/20/14 12:00:27
[alice@server1 ~]$
If a user tries to look at the credentials cache without having authenticated
first, no credentials will be found (see Example 4-5).
Example 4-5. No credentials cache found
[alice@server1 ~]$ klist
No credentials cache found (ticket cache FILE:/tmp/krb5cc_5000
[alice@server1 ~]$
Another useful tool in the MIT Kerberos toolbox is kdestroy. As the name
implies, this allows users to destroy credentials in their credentials cache.
This is useful for switching users, or when trying out or debugging new
configurations (see Example 4-6).
Example 4-6. Destroying the credentials cache with kdestroy
[alice@server1 ~]$ kinit
Enter password for alice@EXAMPLE.COM:
[alice@server1 ~]$ klist
Ticket cache: FILE:/tmp/krb5cc_5000
Default principal: alice@EXAMPLE.COM
Valid starting Expires Service principal
02/13/14 12:00:27 02/14/14 12:00:27 krbtgt/EXAMPLE.COM@EXAMPLE.COM
renew until 02/20/14 12:00:27
[alice@server1 ~]$ kdestroy
[alice@server1 ~]$ klist
No credentials cache found (ticket cache FILE:/tmp/krb5cc_5000
[alice@server1 ~]$
So far, all of the MIT Kerberos examples shown “just work.” Hidden away
in these examples is the fact that there is a fair amount of configuration
necessary to make it all work, both on the client and server side. The next
two sections present basic configurations to tie together some of the concepts
that have been presented thus far.
Kerberos server configuration is primarily specified in the kdc.conf file,
which is shown in Example 4-7. This file lives in /var/kerberos/krb5kdc/ on
Red Hat/CentOS systems.
[kdcdefaults]
kdc_ports = 88
kdc_tcp_ports = 88
[realms]
EXAMPLE.COM = {
acl_file = /var/kerberos/krb5kdc/kadm5.acl
dict_file = /usr/share/dict/words
supported_enctypes = aes256-cts:normal aes128-cts:normal arcfour-hmac-
md5:normal
max_renewable_life = 7d
}
The first section, kdcdefaults, contains configurations that apply to all the
realms listed, unless the specific realm configuration has values for the same
configuration items. The configurations kdc_ports and kdc_tcp_ports
specify the UDP and TCP ports the KDC should listen on, respectively. The
next section, realms, contains all of the realms that the KDC is the server
for. A single KDC can support multiple realms. The realm configuration
items from this example are as follows:
acl_file
This specifies the file location to be used by the admin server for access
controls (more on this later).
dict_file
This specifies the file that contains words that are not allowed to be used
as passwords because they are easily cracked/guessed.
supported_enctypes
This specifies all of the encryption types supported by the KDC. When
interacting with the KDC, clients must support at least one of the
encryption types listed here. Be aware of using weak encryption types,
such as DES, because they are easily exploitable.
max_renewable_life
This specifies the maximum amount of time that a ticket can be
renewable. Clients can request a renewable lifetime up to this length. A
typical value is seven days, denoted by 7d.
NOTE
By default, encryption settings in MIT Kerberos are often set to a variety of encryption
types, including weak choices such as DES. When possible, remove weak encryption
types to ensure the best possible security. Weak encryption types are easily exploitable and
well documented as such. When using AES-256, Java Cryptographic Extensions need to
be installed on all nodes in the cluster to allow for unlimited strength encryption types. It is
important to note that some countries prohibit the usage of these encryption types. Always
follow the laws governing encryption strength for your country. A more detailed discussion
of encryption is provided in Chapter 9.
The acl_file location (typically the file kadm5.acl) is used to control
which users have privileged access to administer the Kerberos database.
Administration of the Kerberos database is controlled by two different, but
related, components: kadmin.local and kadmin. The first is a utility that
allows the root user of the KDC server to modify the Kerberos database. As
the name implies, it can only be run by the root user on the same machine
where the Kerberos database resides. Administrators wishing to administer
the Kerberos database remotely must use the kadmin server.
The kadmin server is a daemon process that allows remote connections to
administer the Kerberos database. This is where the kadm5.acl file (shown
in Example 4-8) comes into play. The kadmin utility uses Kerberos
authentication, and the kadm5.acl file specifies which UPNs are allowed to
perform privileged functions.
*/admin@EXAMPLE.COM *
cloudera-scm@EXAMPLE.COM * hdfs/*@EXAMPLE.COM
cloudera-scm@EXAMPLE.COM * mapred/*@EXAMPLE.COM
This allows any principal from the EXAMPLE.COM realm with the /admin
distinction to perform any administrative action. While it is certainly
acceptable to change the admin distinction to some other arbitrary name, it is
recommended to follow the convention for simplicity and maintainability.
Administrative users should only use their admin credentials for specific
privileged actions, much in the same way administrators should not use the
root user in Linux for everyday nonadministrative actions.
The example also shows how the ACL can be defined to restrict privileges to
a target principal. It demonstrates that the user cloudera-scm can perform
any action but only on SPNs that start with hdfs and mapred. This type of
syntax is useful to grant access to a third-party tool to create and administer
Hadoop principals, but not grant access to all of the admin functions.
As mentioned earlier, the kadmin tool allows for administration of the
Kerberos database. This tool brings users to a shell-like interface where
various commands can be entered to perform operations against the Kerberos
database (see Examples 4-9 through 4-12.
Example 4-9. Adding a new principal to the Kerberos database
kadmin: addprinc alice@EXAMPLE.COM
WARNING: no policy specified for alice@EXAMPLE.COM; defaulting to no policy
Enter password for principal "alice@EXAMPLE.COM":
Re-enter password for principal "alice@EXAMPLE.COM":
Principal "alice@EXAMPLE.COM" created.
kadmin:
Example 4-10. Displaying the details of a principal in the Kerberos
database
kadmin: getprinc alice@EXAMPLE.COMalice@EXAMPLE.COM
Expiration date: [never]
Last password change: Tue Feb 18 20:48:15 EST 2014
Password expiration date: [none]
Maximum ticket life: 1 day 00:00:00
Maximum renewable life: 7 days 00:00:00
Last modified: Tue Feb 18 20:48:15 EST 2014 (root/admin@EXAMPLE.COM)
Last successful authentication: [never]
Last failed authentication: [never]
Failed password attempts: 0
Number of keys: 2
Key: vno 1, aes256-cts-hmac-sha1-96, no salt
Key: vno 1, aes128-cts-hmac-sha1-96, no salt
MKey: vno1
Attributes:
Policy: [none]
kadmin:
Example 4-11. Deleting a principal from the Kerberos database
kadmin: delprinc alice@EXAMPLE.COM
Are you sure you want to delete the principal "alice@EXAMPLE.COM"? (yes/no):
yes
Principal "alice@EXAMPLE.COM" deleted.
Make sure that you have removed this principal from all ACLs before reusing.
kadmin:
Example 4-12. Listing all the principals in the Kerberos database
kadmin: listprincs
HTTP/server1.example.com@EXAMPLE.COM
K/M@EXAMPLE.COM
flume/server1.example.com@EXAMPLE.COM
hdfs/server1.example.com@EXAMPLE.COM
hdfs@EXAMPLE.COM
hive/server1.example.com@EXAMPLE.COM
hue/server1.example.com@EXAMPLE.COM
impala/server1.example.com@EXAMPLE.COM
kadmin/admin@EXAMPLE.COM
kadmin/server1.example.com@EXAMPLE.COM
kadmin/changepw@EXAMPLE.COM
krbtgt/EXAMPLE.COM@EXAMPLE.COM
mapred/server1.example.com@EXAMPLE.COM
oozie/server1.example.com@EXAMPLE.COM
yarn/server1.example.com@EXAMPLE.COM
zookeeper/server1.example.com@EXAMPLE.COM
kadmin:
The default Kerberos client configuration file is typically named krb5.conf,
and lives in the /etc/ directory on Unix/Linux systems. This configuration file
is read whenever client applications need to use Kerberos, including the
kinit utility. The krb5.conf shown in Example 4-13 configuration file is
minimally configured from the default that comes with Red Hat/CentOS 6.4.
[logging]
default = FILE:/var/log/krb5libs.log
kdc = FILE:/var/log/krb5kdc.log
admin_server = FILE:/var/log/kadmind.log
[libdefaults]
default_realm = DEV.EXAMPLE.COM
dns_lookup_realm = false
dns_lookup_kdc = false
ticket_lifetime = 24h
renew_lifetime = 7d
forwardable = true
default_tkt_enctypes = aes256-cts aes128-cts
default_tgs_enctypes = aes256-cts aes128-cts
udp_preference_limit = 1
[realms]
EXAMPLE.COM = {
kdc = kdc.example.com
admin_server = kdc.example.com
}
DEV.EXAMPLE.COM = {
kdc = kdc.dev.example.com
admin_server = kdc.dev.example.com
}
[domain_realm]
.example.com = EXAMPLE.COM
example.com = EXAMPLE.COM
.dev.example.com = DEV.EXAMPLE.COM
dev.example.com = DEV.EXAMPLE.COM
In this example, there are several different sections. The first, logging, is
self-explanatory. It defines where logfiles are stored for the various
Kerberos components that generate log events. The second section,
libdefaults, contains general default configuration information. Let’s take
a closer look at the individual configurations in this section:
default_realm
This defines what Kerberos realm should be assumed if no realm is
provided. This is right in line with the earlier kinit example when a
realm was not provided.
dns_lookup_realm
DNS can be used to determine what Kerberos realm to use.
dns_lookup_kdc
DNS can be used to find the location of the KDC.
ticket_lifetime
renew_lifetime
This specifies how long a ticket can be renewed for. Tickets can be
renewed by the KDC without having a client reauthenticate. This must be
done prior to tickets expiring.
forwardable
This specifies that tickets can be forwardable, which means that if a user
has a TGT already but logs into a different remote system, the KDC can
automatically reissue a new TGT without the client having to
reauthenticate.
default_tkt_enctypes
This specifies the encryption types to use for session keys when making
requests to the AS. Preference from highest to lowest is left to right.
default_tgs_enctypes
This specifies the encryption types to use for session keys when making
requests to the TGS. Preference from highest to lowest is left to right.
udp_preference_limit
This specifies the maximum packet size to use before switching to TCP
instead of UDP. Setting this to 1 forces TCP to always be used.
The next section, realms, lists all the Kerberos realms that the client is
aware of. The kdc and admin_server configurations tell the client which
server is running the KDC and kadmin processes, respectively. These
configurations can specify the port along with the hostname. If no port is
specified, it is assumed to use port 88 for the KDC and 749 for admin server.
In this example, two realms are shown. This is a common configuration
where a one-way trust exists between two realms, and clients need to know
about both realms. In this example, perhaps the EXAMPLE.COM realm contains
all of the end-user principals and DEV.EXAMPLE.COM contains all of the
Hadoop service principals for a development cluster. Setting up Kerberos in
this fashion allows users of this dev cluster to use their existing credentials
in EXAMPLE.COM to access it.
The last section, domain_realm, maps DNS names to Kerberos realms. The
first entry says all hosts under the example.com domain map to the
EXAMPLE.COM realm, while the second entry says that example.com itself
maps to the EXAMPLE.COM realm. This is similarly the case with
dev.example.com and DEV.EXAMPLE.COM. If no matching entry is found in
this section, the client will try to use the domain portion of the DNS name
(converted to all uppercase) as the realm name.
The important takeaway from this chapter is that Kerberos authentication is a
multistep client/server process to provide strong authentication of both users
and services. We took a look at the MIT Kerberos distribution, which is a
popular implementation choice. While this chapter covered some of the
details of configuring the MIT Kerberos distribution, we strongly encourage
you to refer to the official MIT Kerberos documentation, as it is the most up-
to-date reference for the latest distribution; in addition, it serves as a more
detailed guide about all of the configuration options available to a security
administrator for setting up a Kerberos environment.
Part II. Authentication,
Authorization, and Accounting
Chapter 5. Identity and
Authentication
The first step necessary for any system securing data is to provide each user with a
unique identity and to authenticate a user’s claim of a particular identity. The reason
authentication and identity are so essential is that no authorization scheme can control
access to data if the scheme can’t trust that users are who they claim to be.
In this chapter, we’ll take a detailed look at how authentication and identity are managed
for core Hadoop services. We start by looking at identity and how Hadoop integrates
information from Kerberos KDCs and from LDAP and Active Directory domains to
provide an integrated view of distributed identity. We’ll also look at how Hadoop
represents users internally and the options for mapping external, global identities to
those internal representations. Next, we revisit Kerberos and go into more details of
how Hadoop uses Kerberos for strong authentication. From there, we’ll take a look at
how some core components use username/password–based authentication schemes and
the role of distributed authentication tokens in the overall architecture. We finish the
chapter with a discussion of user impersonation and a deep dive into the configuration of
Hadoop authentication.
In the context of the Hadoop ecosystem, identity is a relatively complex topic. This is
due to the fact that Hadoop goes to great lengths to be loosely coupled from authoritative
identity sources. In Chapter 4, we introduced the Kerberos authentication protocol, a
topic that will figure prominently in the following section, as it’s the default secure
authentication protocol used in Hadoop. While Kerberos provides support for robust
authentication, it provides very little in the way of advanced identity features such as
groups or roles. In particular, Kerberos exposes identity as a simple two-part string (or
in the case of services, three-part string) consisting of a short name and a realm. While
this is useful for giving every user a unique identifier, it is insufficient for the
implementation of a robust authorization protocol.
In addition to users, most computing systems provide groups, which are typically
defined as a collection of users. Because one of the goals of Hadoop is to integrate with
existing enterprise systems, Hadoop took the pragmatic approach of using a pluggable
system to provide the traditional group concept.
Mapping Kerberos Principals to Usernames
![]()
![]()
![]()
![]()
![]()
![]()
Before diving into more details on how Hadoop maps users to groups, we need to
discuss how Hadoop translates Kerberos principal names to usernames. Recall from
Chapter 4 that Kerberos uses a two-part string (e.g., ) or three-part
string (e.g., ) that contains a short name,
realm, and an optional instance name or hostname. To simplify working with usernames,
![]()
![]()
Hadoop maps Kerberos principal names to local usernames. Hadoop can use the
setting in the krb5.conf file, or Hadoop-specific rules can be
![]()
![]()
![]()
![]()
![]()
configured in the parameter in the core-site.xml
file.
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
The value of is set to one or more rules for
mapping principal names to local usernames. A rule can either be the value or
the string followed by three parts: the initial principal translation, the acceptance
filter, and the substitution command. The special value maps names in
Hadoop’s local realm to just the first component (e.g., is
mapped to by the rule).
The initial principal translation
![]()
![]()
The initial principal translation consists of a number followed by the substitution string.
The number matches the number of components, not including the realm, of the principal.
The substitution string defines how the principal will be initially translated. The
variable
will be substituted with the realm, will be substituted with the first
component, and
will be substituted with the second component. See Table 5-1 for
some example initial principal translations. The format of the initial principal translation
is and the output is called the initial local name.
Table 5-1. Example principal translations
Principal
translation
Initial local name for
Initial local name for
![]()
![]()
![]()
![]()
![]()
No match
No match
![]()
![]()
![]()
![]()
![]()
![]()
![]()
No match
No match
![]()
![]()
![]()
The acceptance filter
The acceptance filter is a regular expression, and if the initial local name (i.e., the output
from the first part of the rule) matches the regular expression, then the substitution
command will be run over the string. The initial local name only matches if the entire
string is matched by the regular expression. This is equivalent to having the regular
expression start with a ^ and end with $. See Table 5-2 for some sample acceptance
filters. The format of the acceptance filter is (<regular expression>).
Table 5-2. Example acceptance filters
Acceptance filter alice.EXAMPLE.COM hdfs@EXAMPLE.COM
(.*\.EXAMPLE\.COM) Match No match
(.*@EXAMPLE\.COM) No match Match
(.*EXAMPLE\.COM) Match Match
(EXAMPLE\.COM) No match No match
The substitution command
The substitution command is a sed-style substitution with a regular expression pattern
and a replacement string. Matching groups can be included by surrounding a portion of
the regular expression in parentheses, and referenced in the replacement string by
number (e.g., \1). The group number is determined by the order of the opening
parentheses in the regular expression. See Table 5-3 for some sample substitution
commands. The format of the substitution command is
s/<pattern>/<replacement>/g. The g at the end is optional, and if it is present then
the substitution will be global over the entire string. If the g is omitted, then only the first
substring that matches the pattern will be substituted.
Table 5-3. Example substitution commands
Substitution Command | alice.EXAMPLE.COM | |
s/(.*)\.EXAMPLE.COM/\1/ | alice | Not applicable |
s/.EXAMPLE.COM// | alice | hdfs |
s/E/Q/ | alice.QXAMPLE.COM | |
s/E/Q/g | alice.QXAMPLQ.COM |
![]()
![]()
![]()
![]()
The complete format for a rule is
. Multiple rules are separated by new
![]()
![]()
lines and rules are evaluated in order. Once a principal fully matches a rule (i.e., the
principal matches the number in the initial principal translation and the initial local name
matches the acceptance filter), the username becomes the output of that rule and no other
rules are evaluated. Due to this order constraint, it’s common to list the rule
last.
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
The most common use of the setting is to configure how to handle
principals from other Kerberos realms. A common scenario is to have one or more
trusted realms. For example, if your Hadoop realm is but your
corporate realm is , then you’d add rules to translate principals in
the corporate realm into local users. See Example 5-1 for a sample configuration that
only accepts users in the and realms, and
maps users to the first component for both realms.
Example 5-1. Example auth_to_local configuration for a trusted realm
![]()
![]()
![]()
![]()

![]()
![]()
![]()
![]()
![]()
Hadoop User to Group Mapping
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Hadoop exposes a configuration parameter called
to control how users are mapped to groups. The default implementation uses either
native calls or local shell commands to look up user-to-group mappings using the
standard UNIX interfaces. This means that only the groups that are configured on the
server where the mapping is called are visible to Hadoop. In practice, this is not a major
concern because it is important for all of the servers in your Hadoop cluster to have a
consistent view of the users and groups that will be accessing the cluster.
NOTE
In addition to knowing how the user-to-group mapping system works, it is important to know where the
mapping takes place. As described in Chapter 6, it is important for user-to-group mappings to get resolved
consistently and at the point where authorization decisions are made. For Hadoop, that means that the
mappings occur in the NameNode, JobTracker (for MR1), and ResourceManager (for YARN/MR2)
processes. This is a very important detail, as the default user-to-group mapping implementation determines
group membership by using standard UNIX interfaces; for a group to exist from Hadoop’s perspective, it
must exist from the perspective of the servers running the NameNode, JobTracker, and
ResourceManager.
The hadoop.security.group.mapping configuration parameter can be set to any Java
class that implements the
org.apache.hadoop.security.GroupMappingServiceProvider interface. In
addition to the default described earlier, Hadoop ships with a number of useful
implementations of this interface which are summarized here:
JniBasedUnixGroupsMapping
A JNI-based implementation that invokes the getpwnam_r() and getgrouplist()
libc functions to determine group membership.
JniBasedUnixGroupsNetgroupMapping
An extension of the JniBasedUnixGroupsMapping that invokes the
setnetgrent(), getnetgrent(), and endnetgrent() libc functions to determine
members of netgroups. Only netgroups that are used in service-level authorization
access control lists are included in the mappings.
ShellBasedUnixGroupsMapping
A shell-based implementation that uses the id -Gn command.
ShellBasedUnixGroupsNetgroupMapping
An extension of the ShellBasedUnixGroupsMapping that uses the getent
netgroup shell command to determine members of netgroups. Only netgroups that
are used in service-level authorization access control lists are included in the
mappings.
JniBasedUnixGroupsMappingWithFallback
A wrapper around the JniBasedUnixGroupsMapping class that falls back to the
ShellBasedUnixGroupsMapping class if the native libraries cannot be loaded (this
is the default implementation).
JniBasedUnixGroupsNetgroupMappingWithFallback
A wrapper around the JniBasedUnixGroupsNetgroupMapping class that falls back
to the ShellBasedUnixGroupsNetgroupMapping class if the native libraries cannot
be loaded.
LdapGroupsMapping
Connects directly to an LDAP or Active Directory server to determine group
membership.
WARNING
Regardless of the group mapping configured, Hadoop will cache group mappings and only call the group
mapping implementation when entries in the cache expire. By default, the group cache is configured to
expire every 300 seconds (5 minutes). If you want updates to your underlying groups to appear in Hadoop
more frequently, then set the hadoop.security.groups.cache.secs property in core-site.xml to the
number of seconds you want entries cached. This should be set small enough for updates to be reflected
quickly, but not so small as to require unnecessary calls to your LDAP server or other group provider.
Mapping users to groups using LDAP
Most deployments can use the default group mapping provider. However, for
environments where groups are only available directly from an LDAP or Active
Directory server and not on the cluster nodes, Hadoop provides the
LdapGroupsMapping implementation. This method can be configured by setting several
required parameters in the core-site.xml file on the NameNode, JobTracker, and/or
hadoop.security.group.mapping.ldap.url
The URL of the LDAP server to use for resolving groups. Must start with ldap://
or ldaps:// (if SSL is enabled).
hadoop.security.group.mapping.ldap.bind.user
The distinguished name of the user to bind as when connecting to the LDAP server.
This user needs read access to the directory and need not be an administrator.
hadoop.security.group.mapping.ldap.bind.password
The password of the bind user. It is a best practice to not use this setting, but to put
the password in a separate file and to configure the
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
property to point
to that path.
WARNING
If you’re configuring Hadoop to directly use LDAP, you lose the local groups for Hadoop service
accounts such as . This can lead to a large number of log messages similar to:
For this reason, it’s generally better to use the JNI or shell-based mappings and to integrate with
LDAP/Active Directory at the operating system level. The System Security Services Daemon (SSSD)
provides strong integration with a number of identity and authentication systems and handles common
support for caching and offline access.
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Using the parameters described earlier, Example 5-2 demonstrates how to implement
in coresite.xml.
Example 5-2. Example LDAP mapping in core-site.xml

In addition to the required parameters, there are several optional parameters that can be
set to control how users and groups are mapped.
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
The path to a file that contains the password of the bind user. This file should only be
readable by the Unix users that run the daemons (typically , , and ).
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Set to true to enable the use of SSL when conntecting to the LDAP server. If this
setting is enabled, the hadoop.security.group.mapping.ldap.url must start
with ldaps://.
hadoop.security.group.mapping.ldap.ssl.keystore
The path to a Java keystore that contains the client certificate required by the LDAP
server when connecting with SSL enabled. The keystore must be in the Java keystore
(JKS) format.
hadoop.security.group.mapping.ldap.ssl.keystore.password
The password to the hadoop.security.group.mapping.ldap.ssl.keystore
file. It is a best practice to not use this setting, but to put the password in a separate
file and configure the
hadoop.security.group.mapping.ldap.ssl.keystore.password.file
property to point to that path.
hadoop.security.group.mapping.ldap.ssl.keystore.password.file
The path to a file that contains the password to the
hadoop.security.group.mapping.ldap.ssl.keystore file. This file should
only be readable by Unix users that run the daemons (typically hdfs, mapred, and
yarn).
hadoop.security.group.mapping.ldap.base
The search base for searching the LDAP directory. This is a distinguished name and
will typically be configured as specifically as possible while still covering all users
who access the cluster.
hadoop.security.group.mapping.ldap.search.filter.user
A filter to use when searching the directory for LDAP users. The default setting,
(&(objectClass=user)(sAMAccountName={0})), is usually appropriate for
Active Directory installations. For other LDAP servers, this setting must be changed.
For OpenLDAP and compatible servers, the recommended setting is (&
(objectClass=inetOrgPerson)(uid={0})).
hadoop.security.group.mapping.ldap.search.filter.group
A filter to use when searching the directory for LDAP groups. The default setting,
(objectClass=group), is usually appropriate for Active Directory installations.
hadoop.security.group.mapping.ldap.search.attr.member
The attribute of the group object that identifies the users that are members of the
group.
hadoop.security.group.mapping.ldap.search.attr.group.name
The attribute of the group object that identifies the group’s name.
hadoop.security.group.mapping.ldap.directory.search.timeout
The maximum amount of time in milliseconds to wait for search results from the
directory.
One of the most difficult requirements of Hadoop security to understand is that all users
of a cluster must be provisioned on all servers in the cluster. This means they can either
exist in the local /etc/passwd password file or, more commonly, can be provisioned by
having the servers access a network-based directory service, such as OpenLDAP or
Active Directory. In order to understand this requirement, it’s important to remember that
Hadoop is effectively a service that lets you submit and execute arbitrary code across a
cluster of machines. This means that if you don’t trust your users, you need to restrict
their access to any and all services running on those servers, including standard Linux
services such as the local filesystem. Currently, the best way to enforce those
restrictions is to execute individual tasks (the processes that make up a job) on the
cluster using the username and UID of the user who submitted the job. In order to satisfy
that requirement, it is necessary that every server in the cluster uses a consistent user
database.
NOTE
While it is necessary for all users of the cluster to be provisioned on all of the servers in the cluster, it is
not necessary to enable local or remote shell access to all of those users. A best practice is to provision
the users with a default shell of /sbin/nologin and to disable SSH access using the AllowUsers,
DenyUsers, AllowGroups, and DenyGroups settings in the /etc/ssh/sshd_config file.
most components of the ecosystem because Hadoop standardized on it early on in its
development of security features. A summary of the authentication methods by service
and protocol is shown in Table 5-4. In this section, we focus on authentication for
HDFS, MapReduce, YARN, HBase, Accumulo, and ZooKeeper. Authentication for
Hive, Impala, Hue, Oozie, and Solr are deferred to Chapters 11 and 12 because those
are commonly accessed directly by clients.
Table 5-4. Hadoop ecosystem authentication methods
Service | Protocol | Methods |
HDFS | RPC | Kerberos, delegation token |
HDFS | Web UI | SPNEGO (Kerberos), pluggable |
HDFS | REST | SPNEGO (Kerberos), delegation token |
HDFS | REST (HttpFS) | SPNEGO (Kerberos), delegation token |
MapReduce | RPC | Kerberos, delegation token |
MapReduce | Web UI | SPNEGO (Kerberos), pluggable |
YARN | RPC | Kerberos, delegation token |
YARN | Web UI | SPNEGO (Kerberos), pluggable |
Hive Server | Thrift | Kerberos, LDAP (username/password) |
Hive | Thrift | Kerberos, LDAP (username/password) |
Impala | Thrift | Kerberos, LDAP (username/password) |
HBase | RPC | Kerberos, delegation token |
HBase | Thrift Proxy | None |
HBase | REST Proxy | SPNEGO (Kerberos) |
Accumulo | RPC | Username/password, pluggable |
Accumulo | Thrift Proxy | Username/password, pluggable |
Solr | HTTP | Based on HTTP container |
Oozie | REST | SPNEGO (Kerberos, delegation token) |
Service | Protocol | Methods |
Hue | Web UI | Username/password (database, PAM, LDAP), SAML, OAuth, SPNEGO |
ZooKeeper | RPC | Digest (username/password), IP, SASL (Kerberos), pluggable |
Out of the box, Hadoop supports two authentication mechanisms: simple and kerberos.
The simple mechanism, which is the default, uses the effective UID of the client process
to determine the username, which it passes to Hadoop with no additional credentials. In
this mode, Hadoop servers fully trust their clients. This default is sufficient for
deployments where any user that can gain access to the cluster is fully trusted with
access to all data and administrative functions on said cluster. For proof-of-concept
systems or lab environments, it is often permissible to run in this mode and rely on
firewalls and limiting the set of users that can log on to any system with client-access to
the cluster. However, this is rarely acceptable for a production system or any system
with multiple tenants. Simple authentication is similarly supported by HBase as its
default mechanism.
HDFS, MapReduce, YARN, HBase, Oozie, and ZooKeeper all support Kerberos as an
authentication mechanism for clients, though the implementations differ somewhat by
service and interface. For RPC-based protocols, the Simple Authentication and(SASL) framework is used to add authentication to the underlying
protocol. In theory, any SASL mechanism could be supported, but in practice, the only
mechanisms that are supported are GSSAPI (specifically Kerberos V5) and DIGEST-
MD5 (see “Tokens” for details on DIGEST-MD5). Oozie does not have an RPC
protocol and instead provides clients a REST interface. Oozie uses the Simple and(SPNEGO), a protocol first implemented by
Microsoft in Internet Explorer 5.0.1 and IIS 5.0 to do Kerberos authentication over
HTTP. SPNEGO is also supported by the web interfaces for HDFS, MapReduce,
YARN, Oozie, and Hue as well as the REST interfaces for HDFS (both WebHDFS and
HttpFS) and HBase. For both SASL and SPNEGO, the authentication follows the
standard Kerberos protocol and only the mechanism for presenting the service ticket
changes.
Let’s see how Alice would authenticate against the HDFS NameNode using Kerberos:
Alice requests a service ticket from the TGS at kdc.example.com for the HDFS
service identified by hdfs/namenode.example.com@EXAMPLE.COM, presenting
her TGT with the request.
![]()
![]()
![]()
![]()
The TGS validates the TGT and provides Alice a service ticket, encrypted with
the principal’s key.
![]()
![]()
![]()
![]()
Alice presents the service ticket to the NameNode (over SASL), which can
decrypt it using the key and
validate the ticket.
Username and Password Authentication
ZooKeeper supports authentication by username and password. Rather than using a
database of usernames and passwords, ZooKeeper defers password checking to the
authorization step (see “ZooKeeper ACLs”). When an ACL is attached to a ZNode, it
includes the authentication scheme and a scheme-specific ID. The scheme-specific ID is
verified using the authentication provider for the given scheme. Username and password
authentication is implemented by the digest authentication provider, which generates a
SHA-1 digest of the username and password. Because verification is deferred to the
authorization check, the authentication step always succeeds. Users add their
![]()
![]()
![]()
authentication details by calling the ![]()
![]()
![]()
![]()
![]()
method with as the scheme and
![]()
![]()
![]()
as the authData where and
are
replaced with their appropriate values.
Accumulo also supports username and password–based authentication. Unlike
ZooKeeper, Accumulo uses the more common approach of storing usernames and
passwords and having an explicit login step that verifies if the password is valid.
![]()
Accumulo’s authentication system is pluggable through different implementations of the
interface. The most common implementation is the
![]()
class, which can be initialized from a
or a Java
file. Sample code for connecting to Accumulo using a username and
password is shown in Example 5-3.
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Example 5-3. Connecting to Accumulo with a username and password
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
The JobTracker then breaks the job into tasks that are subsequently launched by each
TaskTracker in the cluster. Each task has to communicate with the NameNode in order to
open the files that make up its input split. In order for the NameNode to enforce
filesystem permissions, each task has to authenticate against the NameNode. If Kerberos
was the only authentication mechanism, a user’s TGT would have to be distributed to
each task. The downside to that approach is it allows the tasks to authenticate against
any Kerberos protected service, which is not desirable. Hadoop solves this problem by
issuing authentication tokens that can be distributed to each task but are limited to a
specific service.
Delegation tokens
Hadoop has multiple types of tokens that are used to allow subsequent authenticated
access without a TGT or Kerberos service ticket. After authenticating against the
NameNode using Kerberos, a client can obtain a delegation token. The delegation token
is a shared secret between the client and the NameNode and can be used for RPC
authentication using the DIGEST-MD5 mechanism.
Figure 5-1 shows two interactions between a client and the NameNode. First, the client
requests a delegation token using the getDelegationToken() RPC call using a
Kerberos service ticket for authentication (1). The NameNode replies with the
delegation token (2). The client invokes the getListing() RPC call to request a
directory listing, but this time it uses the delegation token for authentication. After
validating the token, the NameNode responds with the requested DirectoryListing
(4).
Figure 5-1. Retrieving and using a delegation token

The token has both an expiration date and a max issue date. The token will expire after
the expiration date, but can be renewed even if expired up until the max issue date. A
delegation token can be requested by the client after any initial Kerberos authentication
to the NameNode. The token also has a designated token renewer. The token renewer
authenticates using its Kerberos credentials when renewing a token on behalf of a user.
The most common use of delegation tokens is for MapReduce jobs, in which case the
client designates the JobTracker as the renewer. The delegation tokens are keyed by the
NameNode’s URL and stored in the JobTracker’s system directory so they can be passed
to the tasks. This allows the tasks to access HDFS without putting a user’s TGT at risk.
File permission checks are performed by the NameNode, not the DataNode. By default,
any client can access any block given only its block ID. To solve this, Hadoop
introduced the notion of block access tokens. Block access tokens are generated by the
NameNode and given to a client after the client is authenticated and the NameNode has
performed the necessary authorization check for access to a file/block. The token
includes the ID of the client, the block ID, and the permitted access mode (READ,
WRITE, COPY, REPLACE) and is signed using a shared secret between the NameNode
and DataNode. The shared secret is never shared with the client and when a block
access token expires, the client has to request a new one from the NameNode.
Figure 5-2 shows how a client uses a block access token to read data. The client will
first use Kerberos credentials to request the location of the block from the NameNode
using the getBlockLocations() RPC call (1). The NameNode will respond with a
LocatedBlock object which includes, among other details, a block access token for the
requested block (2). The client will then request data from the DataNode using the
readBlock() method in the data transfer protocol using the block access token for
authentication (3). Finally, the DataNode will respond with the requested data (4).
Figure 5-2. Accessing a block using a block access token

Job tokens
When submitting a MapReduce job, the JobTracker will create a secret key called a jobthat is used by the tasks of the job to authenticate against the TaskTrackers. The
JobTracker places the token in the JobTracker’s system directory on HDFS and
distributes it to the TaskTrackers over RPC. The TaskTrackers will place the token in
the job directory on the local disk, which is only accessible to the job’s user. The job
token is used to authenticate RPC communication between the tasks and the
TaskTrackers as well as to generate a hash, which ensures that intermediate outputs sent
over HTTP in the shuffle phase are only accessible to the tasks of the job. Furthermore,
the TaskTracker returning shuffle data calculates a hash that each task can use to verify
that it is talking to a true TaskTracker and not an impostor.
Figure 5-3 is a time sequence diagram showing which authentication methods are used
during job setup. First, the client requests the creation of a new job using Kerberos for
authentication (1). The JobTracker responds with a job ID that’s used to uniquely
identify the job (2). The client then requests a delegation token from the NameNode with
the JobTracker as the renewer (3). The NameNode responds with the delegation token
(4). Delegation tokens will only be issued if the client authenticates with Kerberos.
Finally, the client uses Kerberos to authenticate with the JobTracker sending the
delegation token and other required job details.

Things get more interesting once the job starts executing, as Figure 5-4 shows.

The JobTracker will generate a job token for the job and then package up and send the
job token, delegation token, and other required information to the TaskTracker (1). The
JobTracker uses Kerberos authentication when talking to the TaskTracker. The
TaskTracker will then place the tokens into a directory only accessible by the user who
submitted the job, and will launch the tasks (2). The Task uses the delegation token to
open a file and request the block location for its input split (3). The NameNode will
respond with the block location including a block access token for the given block (4).
The Task then uses the block access token to read data from the DataNode (5) and the
DataNode responds with the data (6). As the job progresses, the Task will report task
status to the TaskTracker using the job token to authenticate (7). The TaskTracker will
then report status back to the JobTracker using Kerberos authentication so that overall
job status can be aggregated (8).
There are many services in the Hadoop ecosystem that perform actions on behalf of an
end user. In order to maintain security, these services must authenticate their clients and
be trusted to impersonate other users. Oozie, Hive (in HiveServer2), and Hue all
support impersonating end users when accessing HDFS, MapReduce, YARN, or HBase.
Secure impersonation works consistently across these services and is supported by
designating which users are trusted to perform impersonation. When a trusted user needs
to act on behalf of another user, she must authenticate as herself and supply the username
of the user she is acting on behalf of. Trusted users can be limited to only impersonate
specific groups of users, and only when accessing Hadoop from certain hosts to further
constrain their privileges.
![]()
![]()
![]()
![]()
Impersonation is also sometimes called proxying. The user that can perform the
impersonation (i.e., the user that can proxy other users) is called the proxy. The
configuration parameters for enabling impersonation are
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
and , where is the
![]()
![]()
![]()
username of the user doing the impersonating. The values are comma-separated lists of
hosts and groups, respectively, or to mean all hosts/groups. If you want both Hue and
Oozie to have proxy capabilities, but you want to limit the users that Oozie can proxy to
members of the group, then you’d use a configuration similar to that
shown in Example 5-4.
Example 5-4. Example configuration for impersonation
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()

![]()
![]()
![]()
Configuration
![]()
![]()
![]()
For production deployments, Hadoop supports the mechanism for
authentication. When configured for Kerberos authentication, all users and daemons must
provide valid credentials in order to access RPC interfaces. This means that you must
create a Kerberos service principal for every server/daemon pair in the cluster. You’ll
recall that in Chapter 4 we described the concept of a service principal name (SPN),
which consists of three parts: a service name, a hostname, and a realm. In Hadoop, each
daemon that’s part of a certain service uses that service’s name (hdfs for HDFS,
mapred for MapReduce, and yarn for YARN). Additionally, if you want to enable
Kerberos authentication for the various web interfaces, then you also need to provision
principals with the HTTP service name.
NOTE
The service layout in Table 5-5 is meant to serve as an example, but it isn’t the best way to provision a
cluster. For starters, we’re showing our example with both YARN and MR1 services configured. This is
only meant to show the full range of configuration settings needed for both services. In a real deployment,
you would only deploy one or the other. Similarly, you would not need to deploy a SecondaryNameNode if
you’re running two NameNodes with HA as we’re doing here. Again, this is just to make our example
configuration comprehensive.
Hostname Daemon
nn1.example.com NameNode
JournalNode
nn2.example.com NameNode
JournalNode
snn.example.com SecondaryNameNode
JournalNode
rm.example.com ResourceManager
jt.example.com JobTracker
JobHistoryServer
dn1.example.com DataNode
TaskTracker
Hostname Daemon
NodeManager
dn2.example.com DataNode
TaskTracker
NodeManager
dn3.example.com DataNode
TaskTracker
NodeManager
The first step is to create all of the required SPNs in your Kerberos KDC and to export a
keytab file for each daemon on each server. The list of SPNs required for each host/role
is shown in Table 5-6 along with a recommended name for their respective keytab files.
You need to create different keytab files per server. We recommend using consistent
names per daemon in order to use the same configuration files on all hosts even though
keytab files with the same name on different hosts will contain different keys.
Table 5-6. Required Kerberos principals
Hostname | Daemon | Keytab file | SPN |
nn1.example.com | NameNode/JournalNode | hdfs.keytab | |
HTTP/nn1.example.com@EXAMPLE.COM | |||
nn2.example.com | NameNode/JournalNode | hdfs.keytab | |
HTTP/nn2.example.com@EXAMPLE.COM | |||
snn.example.com | SecondaryNameNode/JournalNode | hdfs.keytab | |
HTTP/snn.example.com@EXAMPLE.COM | |||
rm.example.com | ResourceManager | yarn.keytab | |
jt.example.com | JobTracker | mapred.keytab | |
HTTP/jt.example.com@EXAMPLE.COM | |||
JobHistoryServer | mapred.keytab | ||
dn1.example.com | DataNode | hdfs.keytab | |
HTTP/dn1.example.com@EXAMPLE.COM |
Hostname Daemon Keytab file SPN
TaskTracker mapred.keytab mapred/dn1.example.com@EXAMPLE.COM
HTTP/dn1.example.com@EXAMPLE.COM
NodeManager yarn.keytab yarn/dn1.example.com@EXAMPLE.COM
HTTP/dn1.example.com@EXAMPLE.COM
dn2.example.com DataNode hdfs.keytab hdfs/dn2.example.com@EXAMPLE.COM
HTTP/dn2.example.com@EXAMPLE.COM
TaskTracker mapred.keytab mapred/dn2.example.com@EXAMPLE.COM
HTTP/dn2.example.com@EXAMPLE.COM
NodeManager yarn.keytab yarn/dn2.example.com@EXAMPLE.COM
HTTP/dn2.example.com@EXAMPLE.COM
dn3.example.com DataNode hdfs.keytab hdfs/dn3.example.com@EXAMPLE.COM
HTTP/dn3.example.com@EXAMPLE.COM
TaskTracker mapred.keytab mapred/dn3.example.com@EXAMPLE.COM
HTTP/dn3.example.com@EXAMPLE.COM
NodeManager yarn.keytab yarn/dn3.example.com@EXAMPLE.COM
HTTP/dn3.example.com@EXAMPLE.COM
WARNING
Take care when exporting keytab files, as the default is to randomize the Kerberos key each time a
principal is exported. You can export each principal once and then use the ktutil utility to combine the
necessary keys into the keytab file for each daemon.
We recommend placing the appropriate keytab files into your $HADOOP_CONF_DIR
directory (typically /etc/hadoop/conf).
FULL EXAMPLE CONFIGURATION FILES
A complete set of example configuration files are available in the example repository on GitHub that
accompanies this book.
After you’ve created all of the required SPNs and distributed the keytab files, you need
to configure Hadoop to use Kerberos for authentication. Start by setting
![]()
![]()
![]()
![]()
![]()
![]()
![]()
to in the core-site.xml file, as shown in
Example 5-5.
Example 5-5. Configuring the authentication type to Kerberos

HDFS
Next, we need to configure each daemon with its Kerberos principals and keytab files.
For the NameNode, we also have to enable block access tokens by setting
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
to . The NameNode’s configuration should be
set in the hdfs-site.xml file, as shown in Example 5-6.
Example 5-6. Configuring the NameNode for Kerberos
![]()
![]()

If you are not enabling high availability for HDFS, then you would next configure the
SecondaryNameNode in the hdfs-site.xml file, as shown in Example 5-7.
Example 5-7. Configuring the SecondaryNameNode for Kerberos
![]()
![]()

![]()
![]()
![]()
If you are enabling high availability for HDFS, then you need to configure the
JournalNodes with the following settings in the hdfs-site.xml file, as shown in
Example 5-8.
Example 5-8. Configuring the JournalNode for Kerberos
![]()

Next, we’ll configure the DataNode’s with the following settings in the hdfs-site.xmlfile. In addition to configuring the keytab and principal name, you must configure the
DataNode to use a privileged port for its RPC and HTTP servers. These ports need to
be privileged because the DataNode does not use Hadoop’s RPC framework for the data
transfer protocol. By using privileged ports, the DataNode is authenticating that it was
![]()
started by root using , as shown in Example 5-9.
Example 5-9. Configuring the DataNode for Kerberos
![]()
![]()

WebHDFS is a REST-based protocol for accessing data in HDFS. WebHDFS scales by
serving data over HTTP from the DataNode that stores the blocks being read. In order to
secure access to WebHDFS, you need to set the following parameters in the hdfs-file of the NameNodes and DataNodes, as shown in Example 5-10.
Example 5-10. Configuring WebHDFS for Kerberos
![]()
![]()
![]()

The configuration of HDFS is now complete!
YARN
Now we’ll configure YARN, starting with the ResourceManager. You’ll need to set the
configuration parameters in the yarn-site.xml file, as shown in Example 5-11.
Example 5-11. Configuring the ResourceManager for Kerberos
![]()

We configure the NodeManagers to use Kerberos by setting the configuration paramters
in the yarn-site.xml file, as shown in Example 5-12.
Example 5-12. Configuring the NodeManager for Kerberos
![]()

![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
In addition to configuring the NodeMangaer to use Kerberos for authentication, we need
to configure the NodeManager to use the . The
![]()
![]()
![]()
uses a setuid binary to launch YARN containers. This
allows the NodeManagers to run the containers using the UID of the user that submitted
the job. This is required in a secure configuration to ensure that Alice can’t access files
created by a container launched by Bob. Without the , all of
the containers would run as the user and containers could access each other’s local
files. First, set the configuration parameters in the yarn-site.xml file, as shown in
Example 5-13.
Example 5-13. Configuring the NodeManager with the LinuxContainerExecutor
![]()
![]()
![]()
![]()
![]()
![]()

![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
We also have to configure the executor binary itself. That’s done by setting the
configuration parameters in the container-executor.cfg file, as shown in Example 5-14.
The value for the parameter
should be set to the same value in the yarn-site.xml file and the container-executor.cfg
![]()
file. Typically this is set to .
Example 5-14. Configuring the LinuxContainerExecutor
![]()
![]()
![]()
![]()

![]()
![]()
![]()
![]()
The setting is used to prevent the from
running containers with UIDs below that value. This is typically set to 1000 or 500
depending on where regular user account UIDs start in your environment. In addition to
this setting, you can set a list of explicitly allowed users and a list of explicitly banned
users. The setting is used to allow, among other things, the hive user to run containers.
This is needed when enabling Apache Sentry because Hive impersonation is turned off
when Sentry is enabled.
The final step for configuring YARN to use Kerberos is to configure the
JobHistoryServer. This can be done by setting the configuration parameters in the
mapred-site.xml file, as shown in Example 5-15.
Example 5-15. Configuring the JobHistoryServer for Kerberos
![]()

MapReduce (MR1)
If you’re still using MR1, you will skip the preceding steps for YARN and configure the
JobTracker and TaskTrackers. First, set the configuration parameters in the mapred-file, as shown in Example 5-16.
Example 5-16. Configuring the JobTracker for Kerberos
![]()

Configuring the TaskTrackers is also straightforward. Set the configuration parameters in
the mapred-site.xml file, as shown in Example 5-17.
Example 5-17. Configuring the TaskTracker for Kerberos
![]()
![]()
![]()

![]()
![]()
![]()
![]()
![]()
![]()
When we configured the NodeManagers in Examples 5-12 and 5-13, we also had to
enable the . The is the equivalent in
MR1. Start by setting the configuration parameters in the mapred-site.xml file, as shown
in Example 5-18.
Example 5-18. Configuring the TaskTracker with the LinuxTaskController
![]()
![]()

![]()
We also have to configure the task controller itself. Set the configuration parameters in
the taskcontroller.cfg file. Make sure that the values for
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
match the value, typically , used in the
![]()
mapred-site.xml file. Unlike the , the
![]()
![]()
![]()
![]()
![]()
![]()
![]()
doesn’t let you configure a list of allowed system users. That
means that you might have to lower the and increase the number of users
explicitly banned in the list if you need to allow certain system users to
run jobs, as shown in Example 5-19.
Example 5-19. Configuring the LinuxTaskController
![]()

Oozie
As already discussed, Oozie supports Kerberos for authentication. Before enabling
authenticationin Oozie, you first must configure Oozie to authenticate itself when
accessing Hadoop. This is done by configuring the following parameters (a sample of
the appropriate configuration parameters is shown in Example 5-20):
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Set to when Hadoop has set to
![]()
![]()
![]()
.
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Set this to the default realm of the Hadoop cluster. This should be the same realm as
the setting in the krb5.conf file.
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
The Kerberos principal that Oozie will use to authenticate. This is typically
where
is the fully qualified domain name of the
server running Oozie and is the local Kerberos realm.
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
The path to the keytab file that has the key for the configured Kerberos principal.
Example 5-20. Configuring Oozie to work with a Kerberos-enabled Hadoop cluster
![]()
![]()
![]()
![]()

![]()
![]()
After Oozie is configured to work with your Kerberos-enabled Hadoop cluster, you’re
ready to configure Oozie to use Kerberos for user authentication. The relevant settings
are as follows (an example configuration is shown in Example 5-21):
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Set the type of authentication required by users. This can be set to (the
default), , or the fully qualified class name of a class that implements the
Hadoop interface.
![]()
![]()
![]()
![]()
![]()
![]()
![]()
The amount of time, in seconds, that authentication tokens are valid. Authentication
tokens are returned as a cookie following the initial authentication method (typically
Kerberos/SPNEGO).
![]()
![]()
![]()
![]()
![]()
![]()
A secret used to sign the authentication tokens. If left blank, a random secret will be
generated on startup. If Oozie is configured in HA mode, then this must be the same
secret on all Oozie servers.
![]()
![]()
![]()
![]()
![]()
![]()
The domain name used when generating the authentication cookie. This should be set
to the domain name of the cluster.
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
The Kerberos principal used for the Oozie service. Because Oozie uses SPNEGO
over HTTP for authentication, this must be set to where
is the fully qualified domain name of the Oozie server and is the
local Kerberos realm.
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
The path to the keytab file that has the key for the Kerberos principal.
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Rules for translating from Kerberos principals to local usernames. This parameter
uses the same format as the parameter in
Hadoop. See “Mapping Kerberos Principals to Usernames” and Example 5-21 for
how to configure.
Example 5-21. Configuring Oozie with Kerberos authentication
![]()
![]()
![]()
![]()
![]()

![]()

![]()
![]()
![]()
If you’re running Oozie in high-availability mode, then you need some additional
configuration. First, you should configure Oozie to use ZooKeeper ACLs by setting
![]()
![]()
![]()
![]()
![]()
in the oozie-site.xml file, as shown in Example 5-22.
Example 5-22. Configuring ZooKeeper ACLs for Oozie in oozie-site.xml
![]()
![]()
![]()
![]()
![]()
![]()
If you’re using Oozie with a version of Hadoop prior to Hadoop 2.5.0, then you need to
use the fully qualified domain of the load balancer in the HTTP principal name. For
example, if you have Oozie servers running on , and
![]()
![]()
![]()
![]()
![]()
![]()
and the load balancer runs on , then you’d
![]()
![]()
![]()
![]()
use a principal of on all of the Oozie
servers. In this mode, only access through the load balancer will work. Also, certain
Oozie features such as log streaming won’t work. In this setup, you’d set the following in
your oozie-site.xml file, as shown in Example 5-23.
Example 5-23. Configuring the Oozie SPN in a load balancer environment
![]()
![]()
![]()
![]()
Starting with Hadoop 2.5.0 and later, you can include multiple Kerberos principals in
Oozie’s keytab file. In this case, you’ll include the principal for the load balancer and
the principal for the specific server in the keytab file (e.g.,
![]()
![]()
![]()
![]()
and
![]()
![]()
![]()
![]()
). You then have to set
![]()
![]()
![]()
![]()
![]()
![]()
![]()
to , as shown in Example 5-24.
Example 5-24. Configuring Oozie with multiple SPNs
![]()
![]()
![]()
![]()
![]()
HBase
Configuring HBase with Kerberos authentication is very similar to configuring core
Hadoop.In the interest of space, we refer you to the “Securing Apache HBase” section
of The Apache HBase Reference Guide.
In this chapter, we introduced the concept of identity and showed how Hadoop leverages
Kerberos principal names to map to usernames. We also saw that Hadoop retrieves
group membership information about a user. This will become important in the next
chapter, which covers authorization.
We also analyzed the different ways that authentication takes place in the cluster. While
Kerberos is the canonical example and used frequently, we saw that there are other ways
that authentication happens with the usage of delegation tokens. This is a key piece of the
Hadoopauthentication architecture because it reduces the number of Kerberos
authentication paths that are necessary to complete a workflow—such as an Oozie
workflow that executes a Hive query, which translates to a MapReduce job that
ultimately processes files. Without delegation tokens, each of these steps would require
Kerberos service tickets, adding strain on the Kerberos KDC.
Finally, we introduced the idea of impersonation. We discussed how system users can
authenticate on behalf of other users. This is a frequently used concept because end users
often use tools that sit between them and the services they are attempting to access. With
impersonation, a system or service can then authenticate with a second remote service
and be granted access privileges as if the end user authenticated directly.
In “Authentication”, we saw how the various Hadoop ecosystem projects
support strong authentication to ensure that users are who they claim to be.
However, authentication is only part of the overall security story—you also
need a way to model which actions or data an authenticated user can access.
The protection of resources in this manner is called authorization and is
probably one of the most complex topics related to Hadoop security. Each
service is relatively unique in the services it provides, and thus the
authorization model it supports. The sections in this chapter are divided into
subsections based on how each service implements authorization.
We start by looking at HDFS and its support for POSIX-style file permissions,
as well as its support for service-level authorization to restrict user access to
specific HDFS functions. Next, we turn our attention to MapReduce and
YARN, which support a similar style of service-level authorization as well as
a queue-based model controlling access to system resources. In the case of
MapReduce and YARN, authorization is useful for both security and resource
management/multitenancy (for more information on resource management, we
recommend Hadoop Operations by Eric Sammer [O’Reilly]). Finally, we
cover the authorization features of the popular BigTable clones, Apache HBase
and Apache Accumulo, including a discussion of the pros and cons of role-
based and attribute-based security as well as a discussion of cell-level versus
column-level security.
Every attempt to access a file or directory in HDFS must first pass an
authorization check. HDFS adopts the authorization scheme common to POSIX-
compatible filesystems. Permissions are managed by three distinct classes of
user: owner, group, and others. Each file or directory is owned by a specific
user and that user makes up the object’s owner class. Objects are also assigned
a group and all of the members of that group make up the object’s group class.
All users that are not the owner and do not belong to the group assigned to the
object make up the others class. Read, write, and execute permissions can be
granted to each class independently.
These permissions are represented by a single octal integer that is calculated
by summing the permission values (4 for read, 2 for write, and 1 for execute).
For example, to represent that a class has read and execute permissions for a
directory, an octal value of 5 (4+1) would be assigned. In HDFS, it is not
meaningful, nor is it invalid, to assign the execute permission to a file. For
directories, the execute bit gives permission to access a file’s contents and
metadata information if the name of the file is known. In order to list the names
of files in a directory, you need read permission for the directory.
Regardless of the permissions on a file or a directory, the user that the
NameNode runs as (typically hdfs) and any member of the group defined in
dfs.permissions.superusergroup (defaults to supergroup), can read,
write, or delete any file and directory. As far as HDFS is concerned, they are
the equivalent of root on a Linux system.
The permissions assigned to the owner, group, and others can be represented
by concatenating the three octal values in that order. For example, take a file
for which the owner has read and write permissions and all other users have
only read permission. This file’s permissions would be represented as 644; 6
is assigned to the owner because she has both read and write (4+2), and 4 is
assigned to the group and other classes because they only have read
permissions. For a file for which all permissions have been granted to all
users, the permissions would be 777.
In addition to the standard permissions, HDFS supports three additional
special permissions: setuid, setgid, and sticky. These permissions are also
represented as an octal value with 4 for setuid, 2 for setgid, and 1 for sticky.
These permissions are optional and are included to the left of the regular
permission bits if they are specified. Because files in HDFS can’t be executed,
setuid has no effect. Setgid similarly has no effect on files, but for directories it
forces the group of newly created immediate child files and directories to that
of the parent. This is the default behavior in HDFS, so it is not necessary to
enable setgid on directories. The final permission is often called the sticky bit
and it means that files in a directory can only be deleted by the owner of that
file. Without the sticky bit set, a file can be deleted by anyone that has write
access to the directory. In HDFS, the owner of a directory and the HDFS
superuser can also delete files regardless of whether the sticky bit is set. The
sticky bit is useful for directories, such as /tmp, where you want all users to
have write access to the directory but only the owner of the data should be able
to delete data.
With the release of Hadoop 2.4, HDFS is now equipped with extended ACLs.
These ACLs work very much the same way as extended ACLs in a Unix
environment. This allows files and directories in HDFS to have more
permissions than the basic POSIX permissions.
To use HDFS extended ACLs, they must first be enabled on the NameNode. To
do this, set the configuration property dfs.namenode.acls.enabled to true
in hdfs-site.xml. Example 6-1 shows how HDFS extended ACLs are used.
Example 6-1. HDFS extended ACLs example
[alice@hadoop01 ~]$ hdfs dfs -ls /data
Found 1 items
drwxr-xr-x - alice analysts 0 2014-10-25 19:03 /data/alice
[alice@hadoop01 ~]$ hdfs dfs -getfacl /data/alice
# file: /data/alice
# owner: alice
# group: analysts
user::rwx
group::r-x
other::r-x
[alice@hadoop01 ~]$ hdfs dfs -setfacl -m user:bob:r-x /data/alice
[alice@hadoop01 ~]$ hdfs dfs -setfacl -m group:developers:rwx /data/alice
[alice@hadoop01 ~]$ hdfs dfs -ls /data
Found 1 items
drwxr-xr-x+ - alice analysts 0 2014-10-25 19:03 /data/alice
[alice@hadoop01 ~]$ hdfs dfs -getfacl /data/alice
# file: /data/alice
# owner: alice
# group: analysts
user::rwx
user:bob:r-x
group::r-x
group:developers:rwx
mask::rwx
other::r-x
[alice@hadoop01 ~]$ hdfs dfs -chmod 750 /data/alice
[alice@hadoop01 ~]$ hdfs dfs -getfacl /data/alice
# file: /data/alice
# owner: alice
# group: analysts
user::rwx
group::r-x
group:developers:rwx #effective:r-x
mask::r-x
other::---
[alice@hadoop01 ~]$ hdfs dfs -setfacl -b /data/alice
[alice@hadoop01 ~]$ hdfs dfs -getfacl /data/alice
# file: /data/alice
# owner: alice
# group: analysts
user::rwx
group::r-x
other::---
There are a few points worth highlighting. First, by default, files and
directories do not have any ACLs. After adding an ACL entry to an object, the
HDFS listing now appends a + to the permissions listing, such as in drwxr-xr-
x+. Also, after adding an ACL entry, a new property is listed in the ACL called
mask. The mask defines what the most restrictive permissions will be. For
example, if user bob has rwx permissions, but the mask is r-x, bob’s effective
permissions are r-x and are noted as such in the output of getfacl, as shown
in the example.
Another important part about the mask is that it gets adjusted to the least
restrictive permissions that are set on an ACL. For example, if a mask is
currently set to be r-x and a new ACL entry is added for a group to grant rwx
permissions, the mask is adjusted to rwx.
WARNING
Setting standard POSIX permissions on a file or directory that contains an extended ACL
might immediately impact all entries because hdfs dfs -chmod will effectively set the
mask, regardless of what ACL entries are present. For example, setting 700 permissions on a
file or directory yields effective permissions of no access to all ACL entries defined, except
the owner!
The last part of the example demonstrates how to completely remove all ACL
entries for a directory, leaving just the basic POSIX permissions in place. One
final point about extended ACLs is that they are limited to 32 entries per object
(i.e., file or directory). That being said, four of the entries are taken up by
user, group, other, and mask, so the net is 28 entries, which can be added
before the NameNode throws an error: setfacl: Invalid ACL: ACL has
33 entries, which exceeds maximum of 32.
Another useful feature of extended ACLs is the usage of a default ACL. A
default ACL applies only to a directory, and the effect is that all subdirectories
and files created in that directory inherit the default ACL of the parent
directory. For example, if a directory has a default ACL entry of
default:group:analysts:rwx, then all files created in the directory will get
a group:analysts:rwx entry, and subdirectories will get both the default
ACL and the access ACL copied over. To set a default ACL, simply prepend
default: to the user or group entry in the setfacl command. Remember that
default ACLs do not themselves grant authorization. They simply define the
inheritance behavior of newly created subdirectories and files.
Hadoop also supports authorization at the service level. This can be used to
control which users or groups of users can access certain protocols, as well as
prevent rogue processes from masquerading as daemons. Service-level
authorization is enabled by setting the hadoop.security.authorization
variable to true in core-site.xml. The actual polices are configured in a file
called hadoop-policy.xml. This file is structured similarly to the standard
configuration files where each property is defined in a property tag with one
sub-tag for the name of the property and another for the value of the property.
Each service-level authorization property defines an access control list (ACL)
with a comma-delimited list of users and groups that can access that protocol.
The two lists are separated by a space. A leading space implies an empty list
of users and a trailing space implies an empty list of groups. A special value of
* can be used to signify that all users are granted access to that protocol (this is
the default setting). Example ACLs are provided in Table 6-1.
Table 6-1. Hadoop access control lists
ACL Meaning
"*" All users are permitted
" " No users are permitted
"alice,bob hdusers" alice, bob, and anyone in the hdusers group are permitted
"alice,bob " (trailing space) alice and bob are permitted, but no groups
" hdusers" (leading space) Anyone in the hdusers group is permitted, but no other users
Before we look at the available ACLs, let’s define some users and groups to
help guide the configuration. Assume that we have a small cluster with a
handful of users and a Hadoop administrator. The users of our cluster have
Linux workstations and we want to make sure that they are able to do as much
development from their workstations as possible, so we aren’t planning to put a
firewall between the workstation network and the cluster. Furthermore, assume
that we have a central Active Directory that defines users and groups for the
entire corporate network. The cluster’s KDC is configured with a one-way
trust to allow AD users to log into the cluster without needing new credentials.
Now we want our Hadoop developers to have access to the cluster, but we
don’t want the entire enterprise browsing HDFS or launching MapReduce jobs.
To help in our setup, we’ve configured two groups, one called hadoop-usersand one called hadoop-admins. Because this is a new environment, we
initially populate the hadoop-users group with just three users: Alice, Bob, and
Joey. Joey is a certified Hadoop administrator so he’s also added to the
hadoop-admins group.
Service-level authorizations are supported by HDFS, MapReduce (MR1), and
YARN (MR2). The list of protocol ACLs and suggested configuration values
for our example are defined for HDFS, MapReduce (MR1), and YARN (MR2)
in Tables 6-2, 6-3, and 6-4, respectively. Some of the properties are shared
among the services, such as protocols for refreshing the policy configuration,
so they will appear in multiple tables. Because MR1 is not included in Hadoop
2.3, some of the property names are different for the MR1 policies. The MR1
property names are used when deploying Hadoop 1.2 or a distribution that
includes MR1 for use with HDFS from Hadoop 2.x.
Table 6-2. HDFS service-level authorization properties
Property name | Description | Suggested |
security.client.protocol.acl | Client to NameNode | "yarn,mapred |
protocol; used by user code | hadoop- | |
DistributedFileSystem class | ||
security.client.datanode.protocol.acl | Client to DataNode protocol | "yarn,mapred |
hadoop- | ||
users" | ||
security.get.user.mappings.protocol.acl | Protocol to retrieve the | "yarn,mapred |
groups that a user maps to | hadoop- | |
users" | ||
security.datanode.protocol.acl | DataNode to NameNode | "hdfs" |
protocol |
Property name | Description | Suggested |
security.inter.datanode.protocol.acl | DataNode to DataNode | "hdfs" |
security.namenode.protocol.acl | SecondaryNameNode to | "hdfs" |
security.qjournal.service.protocol.acl | NameNode to JournalNode | "hdfs" |
security.zkfc.protocol.acl | Protocol exposed by the | "hdfs" |
ZKFailoverController | ||
security.ha.service.protocol.acl | Protocol used by the hdfs | "hdfs,yarn |
hadmin command to | hadoop- | |
security.refresh.policy.protocol.acl | Used by the hdfs | " hadoop- |
dfsadmin command to load | admins" | |
security.refresh.user.mappings.protocol.acl | Protocol to refresh the user | " hadoop- |
Table 6-3. MapReduce (MR1) Service Level Authorization Properties
Property Name Description Suggested
Value
security.task.umbilical.protocol.acl Protocol used
by MR tasks
to report task
progress.
Note: must
be set to *
"*"
security.job.submission.protocol.acl
Protocol for " hadoop-
clients to
submit jobs to
the
JobTracker
users"
Property Name Description Suggested
Value
security.inter.tracker.protocol.acl | Protocol used | "mapred" |
by | ||
TaskTrackers | ||
to | ||
communicate | ||
with the | ||
JobTracker | ||
security.refresh.policy.protocol.acl | Used by | " hadoop- |
hadoop | admins" | |
mradmin command to | ||
load the latest | ||
hadoop- | ||
policy.xml file | ||
security.refresh.usertogroups.mappings.protocol.acl | Protocol to | " hadoop- |
refresh the user to group | admins" | |
mappings | ||
Note: | ||
property | ||
name | ||
changed in | ||
Hadoop 2.0 | ||
security.admin.operations.protocol.acl | Used by the | " hadoop- |
hadoop | ||
mradmin command to | ||
refresh | ||
queues and | ||
nodes at the | ||
JobTracker |
Table 6-4. YARN and MR2 service-level authorization properties
Property name Description Suggested
value
Property name | Description | Suggested |
security.job.task.protocol.acl | Protocol used by MR tasks | "*" |
to report task progress | ||
Note: must be set to * | ||
security.containermanagement.protocol.acl | Protocol used by | "*" |
ApplicationMasters to | ||
communicate with the | ||
NodeManager Note: must | ||
be set to * | ||
security.applicationmaster.protocol.acl | Protocol used by | "*" |
ApplicationMasters to | ||
communicate with the | ||
ResourceManager Note: | ||
must be set to * | ||
security.get.user.mappings.protocol.acl | Protocol to retrieve the | "yarn,mapred |
groups that a user maps to | hadoop- | |
users" | ||
security.applicationclient.protocol.acl | Protocol for clients to submit | " hadoop- |
applications to the | users" | |
security.job.client.protocol.acl | Protocol used by job clients | " hadoop- |
to communicate with the | users" | |
security.mrhs.client.protocol.acl | Protocol used by job clients | " hadoop- |
to communicate with the MapReduce JobHistory | users" | |
server | ||
security.resourcetracker.protocol.acl | ResourceManager to | "yarn" |
NodeManager protocol | ||
security.resourcemanager- | Protocol used by the yarn | "yarn" |
administration.protocol.acl | rmadmin command to | |
administer the | ||
ResourceManager | ||
security.resourcelocalizer.protocol.acl | Protocol used by | "testing" |
Property name | communicate | Suggested value |
security.ha.service.protocol.acl | Protocol used by the yarn | "hdfs,yarn |
rmadmin command to | hadoop- | |
ResourceManager | ||
security.refresh.policy.protocol.acl | Used by the yarn | " hadoop- |
rmadmin command to load | admins" | |
security.refresh.user.mappings.protocol.acl | Protocol to refresh the user | " hadoop- |
aDnedsNcroidpetMioannager to
You’ll notice that even though we want to keep the cluster fairly locked down,
we had to configure four protocols with permissions to allow any user to
connect. The reason for this is that these protocols are accessed by running
tasks that assume the identity of the application or task attempt. The identity
used will vary with every run and is not related to the username that launched
the job. Because these identities cannot be enumerated in advance, they can’t
be listed in the ACLs or added to a group that could be used to limit access to
those protocols. This is not a major concern, as those interfaces are further
protected by a job token (see “Tokens”) that must be presented in order to gain
access.
Most of the protocols fall into one of two categories: protocols that need to be
accessed by clients, and administration protocols. You can use a more
restrictive value for the client protocols if you want to limit which users can
use Hadoop to a whitelist of users or groups. Note, however, that
security.job.task.protocol.acl (for YARN/MR2) and
security.task.umbilical.protocol.acl (for MR1) must always be set to
*. This is required because the user that uses those protocols is always set to
the job ID of the MapReduce job. The job ID changes per job and is not likely
to appear in any groups provisioned for your cluster. Therefore, any setting
other than * for these properties would cause your jobs to fail. Let’s look at
two user sessions, first with the default settings in hadoop-policy.xml
(Example 6-2) and then again with the suggested values from the tables
(Example 6-3).
Example 6-2. Using the default service-level authorization policies
[alice@hadoop01 ~]$ hdfs dfs -ls .
Found 2 items
drwx------ - alice alice 0 2014-03-29 18:59 .Trash
drwx------ - alice alice 0 2014-03-29 18:59 .staging
[alice@hadoop01 ~]$ hdfs dfs -put file.txt .
[alice@hadoop01 ~]$ hdfs dfs -rm file.txt
14/03/29 21:26:07 INFO fs.TrashPolicyDefault: Namenode trash configuration:
Deletion interval = 1440 minutes, Emptier interval = 0 minutes.
Moved: 'hdfs://hadoop02:8020/user/alice/file.txt' to trash at:
hdfs://hadoop02:8020/user/alice/.Trash/Current
[alice@hadoop01 ~]$ hdfs dfs -expunge
14/03/29 21:26:08 INFO fs.TrashPolicyDefault: Namenode trash configuration:
Deletion interval = 1 minutes, Emptier interval = 0 minutes.
14/03/29 21:26:09 INFO fs.TrashPolicyDefault: Deleted trash checkpoint:
/user/alice/.Trash/140329185911
14/03/29 21:26:09 INFO fs.TrashPolicyDefault: Created trash checkpoint:
/user/alice/.Trash/140329212609
[alice@hadoop01 ~]$ hdfs groups
alice@CLOUDERA : alice production-etl hadoop-users
[alice@hadoop01 ~]$ hdfs dfsadmin -refreshNodes
refreshNodes: Access denied for user alice. Superuser privilege is required
[alice@hadoop01 ~]$ hdfs dfsadmin -refreshServiceAcl
[alice@hadoop01 ~]$ hdfs dfsadmin -refreshUserToGroupsMappings
[alice@hadoop01 ~]$ hdfs dfsadmin -refreshSuperUserGroupsConfiguration
[alice@hadoop01 ~]$ yarn rmadmin -refreshQueues
14/03/29 21:26:16 INFO client.RMProxy: Connecting to ResourceManager at
hadoop02/172.25.2.223:8033
[alice@hadoop01 ~]$ yarn rmadmin -refreshNodes
14/03/29 21:26:18 INFO client.RMProxy: Connecting to ResourceManager at
hadoop02/172.25.2.223:8033
[alice@hadoop01 ~]$ yarn rmadmin -refreshSuperUserGroupsConfiguration
14/03/29 21:26:19 INFO client.RMProxy: Connecting to ResourceManager at
hadoop02/172.25.2.223:8033
[alice@hadoop01 ~]$ yarn rmadmin -refreshUserToGroupsMappings
14/03/29 21:26:21 INFO client.RMProxy: Connecting to ResourceManager at
hadoop02/172.25.2.223:8033
[alice@hadoop01 ~]$ yarn rmadmin -refreshAdminAcls
14/03/29 21:26:22 INFO client.RMProxy: Connecting to ResourceManager at
hadoop02/172.25.2.223:8033
[alice@hadoop01 ~]$ yarn rmadmin -refreshServiceAcl
14/03/29 21:26:23 INFO client.RMProxy: Connecting to ResourceManager at
hadoop02/172.25.2.223:8033
[alice@hadoop01 ~]$ yarn rmadmin -getGroups alice
14/03/29 21:26:25 INFO client.RMProxy: Connecting to ResourceManager at
hadoop02/172.25.2.223:8033
alice : alice production-etl hadoop-users
[alice@hadoop01 ~]$ yarn jar /opt/cloudera/parcels/CDH/lib/
hadoop-mapreduce/hadoop-mapreduce-examples.jar randomtextwriter random-text
14/03/29 21:26:26 INFO client.RMProxy: Connecting to ResourceManager at
hadoop02/172.25.2.223:8032
Running 30 maps.
Job started: Sat Mar 29 21:26:27 EDT 2014
14/03/29 21:26:27 INFO client.RMProxy: Connecting to ResourceManager at
hadoop02/172.25.2.223:8032
14/03/29 21:26:27 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN
token 10 for alice on 172.25.2.223:8020
14/03/29 21:26:27 INFO security.TokenCache: Got dt for hdfs://hadoop02:8020;
Kind: HDFS_DELEGATION_TOKEN, Service: 172.25.2.223:8020, Ident:
(HDFS_DELEGATION_TOKEN token 10 for alice)
14/03/29 21:26:28 INFO mapreduce.JobSubmitter: number of splits:30
14/03/29 21:26:28 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1396142628007_0001
14/03/29 21:26:28 INFO mapreduce.JobSubmitter: Kind: HDFS_DELEGATION_TOKEN,
Service: 172.25.2.223:8020, Ident: (HDFS_DELEGATION_TOKEN token 10 for
alice)
14/03/29 21:26:29 INFO impl.YarnClientImpl: Submitted application
application_1396142628007_0001
14/03/29 21:26:29 INFO mapreduce.Job: The url to track the job:
http://hadoop02:8088/proxy/application_1396142628007_0001/
14/03/29 21:26:29 INFO mapreduce.Job: Running job: job_1396142628007_0001
14/03/29 21:26:38 INFO mapreduce.Job: Job job_1396142628007_0001 running
in uber mode : false
14/03/29 21:26:38 INFO mapreduce.Job: map 0% reduce 0%
14/03/29 21:28:37 INFO mapreduce.Job: map 3% reduce 0%
14/03/29 21:28:47 INFO mapreduce.Job: map 7% reduce 0%
14/03/29 | 21:28:53 | INFO | mapreduce.Job: | map | 10% | reduce | 0% |
14/03/29 | 21:29:09 | INFO | mapreduce.Job: | map | 17% | reduce | 0% |
14/03/29 | 21:29:16 | INFO | mapreduce.Job: | map | 23% | reduce | 0% |
14/03/29 | 21:29:17 | INFO | mapreduce.Job: | map | 27% | reduce | 0% |
14/03/29 | 21:29:18 | INFO | mapreduce.Job: | map | 30% | reduce | 0% |
14/03/29 | 21:29:19 | INFO | mapreduce.Job: | map | 33% | reduce | 0% |
14/03/29 | 21:29:22 | INFO | mapreduce.Job: | map | 50% | reduce | 0% |
14/03/29 | 21:29:23 | INFO | mapreduce.Job: | map | 60% | reduce | 0% |
14/03/29 | 21:29:25 | INFO | mapreduce.Job: | map | 70% | reduce | 0% |
14/03/29 | 21:29:31 | INFO | mapreduce.Job: | map | 77% | reduce | 0% |
14/03/29 | 21:30:05 | INFO | mapreduce.Job: | map | 83% | reduce | 0% |
14/03/29 | 21:30:10 | INFO | mapreduce.Job: | map | 90% | reduce | 0% |
14/03/29 | 21:30:12 | INFO | mapreduce.Job: | map | 93% | reduce | 0% |
14/03/29 | 21:30:14 | INFO | mapreduce.Job: | map | 97% | reduce | 0% |
14/03/29 21:30:15 INFO mapreduce.Job: map 100% reduce 0%
14/03/29 21:30:15 INFO mapreduce.Job: Job job_1396142628007_0001
completed successfully
14/03/29 21:30:15 INFO mapreduce.Job: Counters: 29
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=2679890
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=4550
HDFS: Number of bytes written=33067041057
HDFS: Number of read operations=120
HDFS: Number of large read operations=0
HDFS: Number of write operations=60
Job Counters
Launched map tasks=30
Other local map tasks=30
Total time spent by all maps in occupied slots (ms)=4015333
Total time spent by all reduces in occupied slots (ms)=0
Map-Reduce Framework
Map input records=30
Map output records=49159093
Input split bytes=4550
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=22565
CPU time spent (ms)=808110
Physical memory (bytes) snapshot=12234526720
Virtual memory (bytes) snapshot=40489713664
Total committed heap usage (bytes)=12699172864
org.apache.hadoop.examples.RandomTextWriter$Counters
BYTES_WRITTEN=32212265105
RECORDS_WRITTEN=49159093
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=33067041057
Job ended: Sat Mar 29 21:30:15 EDT 2014
The job took 227 seconds.
[alice@hadoop01 ~]$ hdfs dfs -rm -r random-text
14/03/29 21:30:17 INFO fs.TrashPolicyDefault: Namenode trash configuration:
Deletion interval = 1440 minutes, Emptier interval = 0 minutes.
Moved: 'hdfs://hadoop02:8020/user/alice/random-text' to trash at:
hdfs://hadoop02:8020/user/alice/.Trash/Current
[alice@hadoop01 ~]$ hdfs dfs -expunge
14/03/29 21:30:18 INFO fs.TrashPolicyDefault: Namenode trash configuration:
Deletion interval = 1 minutes, Emptier interval = 0 minutes.
14/03/29 21:30:19 INFO fs.TrashPolicyDefault: Deleted trash checkpoint:
/user/alice/.Trash/140329212609
14/03/29 21:30:19 INFO fs.TrashPolicyDefault: Created trash checkpoint:
/user/alice/.Trash/140329213019
The listing in Example 6-2 shows Alice using a number of user and
administrative commands. While some commands, such as hdfs dfsadmin -
refreshNodes, require superuser permissions, many don’t require any special
privileges when using the default service-level authorization policies.
Example 6-3 runs through the exact same set of commands using the previously
recommended policies.
Example 6-3. Using the recommended service-level authorization policies
[alice@hadoop01 ~]$ hdfs dfs -ls .
Found 2 items
drwx------ - alice alice 0 2014-03-29 18:52 .Trash
drwx------ - alice alice 0 2014-03-29 18:45 .staging
[alice@hadoop01 ~]$ hdfs dfs -put file.txt .
[alice@hadoop01 ~]$ hdfs dfs -rm file.txt
14/03/29 18:54:11 INFO fs.TrashPolicyDefault: Namenode trash configuration:
Deletion interval = 1440 minutes, Emptier interval = 0 minutes.
Moved: 'hdfs://hadoop02:8020/user/alice/file.txt' to trash at:
hdfs://hadoop02:8020/user/alice/.Trash/Current
[alice@hadoop01 ~]$ hdfs dfs -expunge
14/03/29 18:54:13 INFO fs.TrashPolicyDefault: Namenode trash configuration:
Deletion interval = 1 minutes, Emptier interval = 0 minutes.
14/03/29 18:54:13 INFO fs.TrashPolicyDefault: Deleted trash checkpoint:
/user/alice/.Trash/140329185237
14/03/29 18:54:13 INFO fs.TrashPolicyDefault: Created trash checkpoint:
/user/alice/.Trash/140329185413
[alice@hadoop01 ~]$ hdfs groups
alice@CLOUDERA : alice production-etl hadoop-users
[alice@hadoop01 ~]$ hdfs dfsadmin -refreshNodes
refreshNodes: Access denied for user alice. Superuser privilege is required
[alice@hadoop01 ~]$ hdfs dfsadmin -refreshServiceAcl
refreshServiceAcl: User alice@CLOUDERA (auth:KERBEROS) is not authorized for
protocol interface
org.apache.hadoop.security.authorize.RefreshAuthorizationPolicyProtocol,
expected client Kerberos principal is null
[alice@hadoop01 ~]$ hdfs dfsadmin -refreshUserToGroupsMappings
refreshUserToGroupsMappings: User alice@CLOUDERA (auth:KERBEROS) is not
authorized for protocol interface
org.apache.hadoop.security.RefreshUserMappingsProtocol, expected client
Kerberos principal is null
[alice@hadoop01 ~]$ hdfs dfsadmin -refreshSuperUserGroupsConfiguration
refreshSuperUserGroupsConfiguration: User alice@CLOUDERA (auth:KERBEROS) is
not authorized for protocol interface
org.apache.hadoop.security.RefreshUserMappingsProtocol, expected client
Kerberos principal is null
[alice@hadoop01 ~]$ yarn rmadmin -refreshQueues
14/03/29 18:54:21 INFO client.RMProxy: Connecting to ResourceManager at
hadoop02/172.25.2.223:8033
refreshQueues: User alice@CLOUDERA (auth:KERBEROS) is not authorized
for protocol interface
org.apache.hadoop.yarn.server.api.ResourceManagerAdministrationProtocolPB,
expected client Kerberos principal is null
[alice@hadoop01 ~]$ yarn rmadmin -refreshNodes
14/03/29 18:54:22 INFO client.RMProxy: Connecting to ResourceManager at
hadoop02/172.25.2.223:8033
refreshNodes: User alice@CLOUDERA (auth:KERBEROS) is not authorized
for protocol interface
org.apache.hadoop.yarn.server.api.ResourceManagerAdministrationProtocolPB,
expected client Kerberos principal is null
[alice@hadoop01 ~]$ yarn rmadmin -refreshSuperUserGroupsConfiguration
14/03/29 18:54:24 INFO client.RMProxy: Connecting to ResourceManager at
hadoop02/172.25.2.223:8033
refreshSuperUserGroupsConfiguration: User alice@CLOUDERA (auth:KERBEROS)
is not authorized for protocol interface
org.apache.hadoop.yarn.server.api.ResourceManagerAdministrationProtocolPB,
expected client Kerberos principal is null
[alice@hadoop01 ~]$ yarn rmadmin -refreshUserToGroupsMappings
14/03/29 18:54:25 INFO client.RMProxy: Connecting to ResourceManager at
hadoop02/172.25.2.223:8033
refreshUserToGroupsMappings: User alice@CLOUDERA (auth:KERBEROS)
is not authorized for protocol interface
org.apache.hadoop.yarn.server.api.ResourceManagerAdministrationProtocolPB,
expected client Kerberos principal is null
[alice@hadoop01 ~]$ yarn rmadmin -refreshAdminAcls
14/03/29 18:54:26 INFO client.RMProxy: Connecting to ResourceManager at
hadoop02/172.25.2.223:8033
refreshAdminAcls: User alice@CLOUDERA (auth:KERBEROS)
is not authorized for protocol interface
org.apache.hadoop.yarn.server.api.ResourceManagerAdministrationProtocolPB,
expected client Kerberos principal is null
[alice@hadoop01 ~]$ yarn rmadmin -refreshServiceAcl
14/03/29 18:54:28 INFO client.RMProxy: Connecting to ResourceManager at
hadoop02/172.25.2.223:8033
refreshServiceAcl: User alice@CLOUDERA (auth:KERBEROS)
is not authorized for protocol interface
org.apache.hadoop.yarn.server.api.ResourceManagerAdministrationProtocolPB,
expected client Kerberos principal is null
[alice@hadoop01 ~]$ yarn rmadmin -getGroups alice
14/03/29 18:54:29 INFO client.RMProxy: Connecting to ResourceManager at
hadoop02/172.25.2.223:8033
getGroups: User alice@CLOUDERA (auth:KERBEROS)
is not authorized for protocol interface
org.apache.hadoop.yarn.server.api.ResourceManagerAdministrationProtocolPB,
expected client Kerberos principal is null
[alice@hadoop01 ~]$ yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/
hadoop-mapreduce-examples.jar randomtextwriter random-text
14/03/29 18:54:31 INFO client.RMProxy: Connecting to ResourceManager at
hadoop02/172.25.2.223:8032
Running 30 maps.
Job started: Sat Mar 29 18:54:32 EDT 2014
14/03/29 18:54:32 INFO client.RMProxy: Connecting to ResourceManager at
hadoop02/172.25.2.223:8032
14/03/29 18:54:32 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN
token 9 for alice on 172.25.2.223:8020
14/03/29 18:54:32 INFO security.TokenCache: Got dt for hdfs://hadoop02:8020;
Kind: HDFS_DELEGATION_TOKEN, Service: 172.25.2.223:8020, Ident:
(HDFS_DELEGATION_TOKEN token 9 for alice)
14/03/29 18:54:32 INFO mapreduce.JobSubmitter: number of splits:30
14/03/29 18:54:32 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1396131817617_0003
14/03/29 18:54:32 INFO mapreduce.JobSubmitter: Kind: HDFS_DELEGATION_TOKEN,
Service: 172.25.2.223:8020, Ident: (HDFS_DELEGATION_TOKEN token 9 for alice)
14/03/29 18:54:33 INFO impl.YarnClientImpl: Submitted application
application_1396131817617_0003
14/03/29 18:54:33 INFO mapreduce.Job: The url to track the job:
http://hadoop02:8088/proxy/application_1396131817617_0003/
14/03/29 18:54:33 INFO mapreduce.Job: Running job: job_1396131817617_0003
14/03/29 18:54:40 INFO mapreduce.Job: Job job_1396131817617_0003
running in uber mode : false
14/03/29 | 18:54:40 | INFO | mapreduce.Job: | map | 0% | reduce | 0% |
14/03/29 | 18:56:20 | INFO | mapreduce.Job: | map | 3% | reduce | 0% |
14/03/29 | 18:56:53 | INFO | mapreduce.Job: | map | 7% | reduce | 0% |
14/03/29 | 18:56:57 | INFO | mapreduce.Job: | map | 10% | reduce | 0% |
14/03/29 | 18:56:59 | INFO | mapreduce.Job: | map | 13% | reduce | 0% |
14/03/29 | 18:57:02 | INFO | mapreduce.Job: | map | 17% | reduce | 0% |
14/03/29 | 18:57:15 | INFO | mapreduce.Job: | map | 20% | reduce | 0% |
14/03/29 | 18:57:36 | INFO | mapreduce.Job: | map | 27% | reduce | 0% |
14/03/29 | 18:57:44 | INFO | mapreduce.Job: | map | 30% | reduce | 0% |
14/03/29 | 18:57:59 | INFO | mapreduce.Job: | map | 33% | reduce | 0% |
14/03/29 | 18:58:09 | INFO | mapreduce.Job: | map | 37% | reduce | 0% |
14/03/29 | 18:58:19 | INFO | mapreduce.Job: | map | 40% | reduce | 0% |
14/03/29 | 18:58:23 | INFO | mapreduce.Job: | map | 43% | reduce | 0% |
14/03/29 | 18:58:25 | INFO | mapreduce.Job: | map | 47% | reduce | 0% |
14/03/29 | 18:58:35 | INFO | mapreduce.Job: | map | 50% | reduce | 0% |
14/03/29 | 18:58:36 | INFO | mapreduce.Job: | map | 53% | reduce | 0% |
14/03/29 | 18:58:39 | INFO | mapreduce.Job: | map | 57% | reduce | 0% |
14/03/29 | 18:58:40 | INFO | mapreduce.Job: | map | 60% | reduce | 0% |
14/03/29 | 18:58:44 | INFO | mapreduce.Job: | map | 63% | reduce | 0% |
14/03/29 | 18:58:45 | INFO | mapreduce.Job: | map | 67% | reduce | 0% |
14/03/29 | 18:58:47 | INFO | mapreduce.Job: | map | 70% | reduce | 0% |
14/03/29 | 18:58:53 | INFO | mapreduce.Job: | map | 73% | reduce | 0% |
14/03/29 | 18:58:55 | INFO | mapreduce.Job: | map | 80% | reduce | 0% |
14/03/29 | 18:58:57 | INFO | mapreduce.Job: | map | 83% | reduce | 0% |
14/03/29 | 18:59:01 | INFO | mapreduce.Job: | map | 90% | reduce | 0% |
14/03/29 | 18:59:05 | INFO | mapreduce.Job: | map | 93% | reduce | 0% |
14/03/29 18:59:07 INFO mapreduce.Job: map 100% reduce 0%
14/03/29 18:59:07 INFO mapreduce.Job: Job job_1396131817617_0003
completed successfully
14/03/29 18:59:07 INFO mapreduce.Job: Counters: 29
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=2679890
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=4550
HDFS: Number of bytes written=33067034387
HDFS: Number of read operations=120
HDFS: Number of large read operations=0
HDFS: Number of write operations=60
Job Counters
Launched map tasks=30
Other local map tasks=30
Total time spent by all maps in occupied slots (ms)=5319195
Total time spent by all reduces in occupied slots (ms)=0
Map-Reduce Framework
Map input records=30
Map output records=49157281
Input split bytes=4550
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=13711
CPU time spent (ms)=741910
Physical memory (bytes) snapshot=10065694720
Virtual memory (bytes) snapshot=40491339776
Total committed heap usage (bytes)=14946533376
org.apache.hadoop.examples.RandomTextWriter$Counters
BYTES_WRITTEN=32212267432
RECORDS_WRITTEN=49157281
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=33067034387
Job ended: Sat Mar 29 18:59:07 EDT 2014
The job took 275 seconds.
[alice@hadoop01 ~]$ hdfs dfs -rm -r random-text
14/03/29 18:59:09 INFO fs.TrashPolicyDefault: Namenode trash
configuration: Deletion interval = 1440 minutes, Emptier
interval = 0 minutes.
Moved: 'hdfs://hadoop02:8020/user/alice/random-text' to trash
at: hdfs://hadoop02:8020/user/alice/.Trash/Current
[alice@hadoop01 ~]$ hdfs dfs -expunge
14/03/29 18:59:10 INFO fs.TrashPolicyDefault: Namenode trash
configuration: Deletion interval = 1 minutes, Emptier
interval = 0 minutes.
14/03/29 18:59:11 INFO fs.TrashPolicyDefault: Deleted trash checkpoint:
/user/alice/.Trash/140329185413
14/03/29 18:59:11 INFO fs.TrashPolicyDefault: Created trash checkpoint:
/user/alice/.Trash/140329185911
alice@CLOUDERA (auth:KERBEROS) is not authorized for protocol
interface <protocol>. This indicates that the user is not listed in the ACL
for that protocol and also doesn’t belong to a group listed in the ACL. Service-
level authorizations are a very powerful, although complex, tool for controlling
access to a Hadoop cluster. For example, with the policies we configured, the
hdfs user no longer has access to view or modify files in HDFS unless it is
added to the hadoop-users group. This is very useful for organizations that
need to track any administrative action back to the administrator who
performed it. If we combine the recommended service-level authorizations by
setting dfs.permissions.superusergroup to hadoop-admins, we can tie
admin actions back to a specific account. Example 6-4 shows what happens
when the hdfs user attempts to list the files in Alice’s home directory and
delete a file that she uploaded.
Example 6-4. User hdfs is denied access to the ClientProtocol
[hdfs@hadoop01 ~]$ hdfs dfs -ls /user/alice/
ls: User hdfs@CLOUDERA (auth:KERBEROS) is not authorized for protocol
interface
org.apache.hadoop.hdfs.protocol.ClientProtocol, expected client Kerberos
principal
is null
[hdfs@hadoop01 ~]$ hdfs dfs -rm /user/alice/file.txt
rm: User hdfs@CLOUDERA (auth:KERBEROS) is not authorized for protocol
interface
org.apache.hadoop.hdfs.protocol.ClientProtocol, expected client Kerberos
principal
is null
Notice that the hdfs user is denied access at the protocol level before any
permission checks can be performed at the HDFS level. Even though hdfs is a
superuser from the filesystem’s perspective, no data can be viewed or modified
due to the service-level check which happens first. Example 6-5 shows what
happens when Joey, a member of the hadoop-admins group, tries to perform
the same actions.
Example 6-5. A member of the hadoop-admins group deleting user files
[joey@hadoop01 ~]$ hdfs dfs -ls /user/alice/
Found 3 items
drwx------ | - | alice | alice | 0 | 2014-03-29 | 21:30 | /user/alice/.Trash |
drwx------ | - | alice | alice | 0 | 2014-03-29 | 21:30 | /user/alice/.staging |
-rw------- | 3 | alice | alice | 5 | 2014-03-29 | 21:48 | /user/alice/file.txt |
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
You’ll notice that this time the actions were allowed. That is because Joey is a
member of both the hadoop-admins group (which is configured as the
superuser group in HDFS) and the hadoop-users group (which gives him
access to the HDFS client protocols).
In addition to configuring ACLs in hadoop-policy.xml, certain HDFS
administrative actions, such as forcing an HA failover, are only available to
HDFS cluster administrators. The administrators are configured by setting
![]()
![]()
in hdfs-site.xml to a comma-delimited list of
users and a comma-delimited list of groups that can administer HDFS. The two
lists are separated by a space. A leading space implies an empty list of users
and a trailing space implies an empty list of groups. A special value of can
be used to signify that all users have administrative access to HDFS; a value of
" " (without the quotes) signifies that no users have access (this is the default
setting). See Example 6-6 for the recommended setting based on our example
environment.
Example 6-6. The dfs.cluster.administrators setting in hdfs-site.xml
![]()
![]()
![]()
![]()
![]()
![]()
![]()
The ACL for administration of the MapReduce Job History server is not
configured in the hadoop-policy.xml file. The ACL is configured by setting
![]()
in mapred-site.xml to a comma-
delimited list of users and a comma-delimited list of groups, identical in format
to those described in “Service-Level Authorization” and depicted in Table 6-1.
A special value of can be used to signify that all users have administrative
access to the JobHistory server (this is the default setting). See Example 6-7
for the recommended setting.
Example 6-7. The mapreduce.jobhistory.admin.acl setting in mapred-site.xml
![]()
![]()
![]()
MapReduce and YARN Authorization
Neither MapReduce nor YARN control access to data, but both provide access
to cluster resources such as CPU, memory, disk I/O, and network I/O. Because
these resources are finite, it is common for administrators to allocate resources
to specific users or groups, especially in multitenant environments. The
service-level authorizations described in the previous section control access to
specific protocols, such as who can and cannot submit a job to the cluster, but
they are not granular enough to control access to cluster resources. Both
MapReduce (MR1) and YARN support job queues as a way of putting limits on
how jobs are allocated resources. In order to securely control those resources,
Hadoop supports access control lists (ACLs) on the job queues. These ACLs
control which users can submit to certain queues as well as which users can
administer a queue. MapReduce defines different classes of users, which affect
the way that ACLs are interpreted:
MapReduce/YARN cluster owner
The user that starts the JobTracker process (MR1) or the ResourceManager
process (YARN) is defined as the cluster owner. That user has permissions
to submit jobs to any queue and can administer any queue or job. In most
cases, the cluster owner is mapred for MapReduce (MR1) and yarn for
YARN. Because it is dangerous to run jobs as the cluster owner, the
LinuxTaskController defaults to blacklisting the mapred and yarn user
accounts so they can’t submit jobs.
MapReduce administrator
There is a setting to create global MapReduce administrators that have the
same privileges as the cluster owner. The advantage to defining specific
users or groups as administrators is that you can still audit the individual
actions of each administrator. This also lets you avoid having to distribute
the password to a shared account, thus increasing the likelihood that the
password could be compromised.
Job owner
Queue administrator
For MR1, ACLs are administered globally and apply to any job scheduler that
supports ACLs. Both the CapacityScheduler and FairScheduler support ACLs;
the FIFO (default) scheduler does not. Before configuring per-queue ACLs, you
must enable MapReduce ACLs, configure the MapReduce administrators, and
define the queue names in mapred-site.xml:
mapred.acls.enabled
When set to true, ACLs will be checked when submitting or administering
jobs. ACLs are also checked for authorizing the viewing and modification
of jobs in the JobTracker interface.
mapreduce.cluster.administrators
Configure administrators for the MapReduce cluster. Cluster administrators
can always administer any job or queue regardless of the configuration of
job- or queue-specific ACLs. The format for this setting is a comma-
delimited list of users and a comma-delimited list of groups that can access
that protocol. The two lists are separated by a space. A leading space
implies an empty list of users and a trailing space implies an empty list of
groups. A special value of * can be used to signify that all users are granted
access to that protocol (this is the default setting). See Table 6-1 for
examples.
mapred.queue.names
A comma-delimited list of queue names. In order to configure ACLs for a
queue, that queue must be listed in this property. MapReduce always
supports at least one queue named default, so this parameter should always
include default among the list of defined queues.
The configuration for per-queue ACLs is stored in mapred-queue-acls.xml.
There are two types of ACLs that can be configured for each queue, a submit
ACL and an administer ACL:
mapred.queue.<queue_name>.acl-submit-job
The access control list for users that can submit jobs to the queue named
queue_name. The format for the submit job ACL is a comma-delimited list
of users and a comma-delimited list of groups that are allowed to submitto this queue, identical in format to hose described in “Service-Level
Authorization” and depicted in Table 6-1. A special value of * can be used
to signify that all users are granted access to that protocol (this is the
default setting). Regardless of the value of this setting, the cluster owner
and MapReduce administrators can submit jobs.
mapred.queue.<queue_name>.acl-administer-jobs
The access control list for users that are allowed to view job details, kill
jobs, or modify a job’s priority for all jobs in the queue named
queue_name. The format for the administer-jobs ACL is a comma-
delimited list of users and a comma delimited list of groups that are
allowed to administer jobs in this queue, identical in format to those
described in “Service-Level Authorization” and depicted in Table 6-1. A
special value of * can be used to signify that all users are granted access to
that protocol (this is the default setting). Regardless of the value of this
setting, the cluster owner and MapReduce administrators can administer all
the jobs in all the queues. The job owner can also administer jobs.
In addition to the per-queue ACLs, there are two types of ACLs that can be
configured on a per-job basis. Defaults for these settings can be placed in the
mapred-site.xml file used by clients and can be overridden by individual jobs:
mapreduce.job.acl-view-job
The access control list for users that are allowed to view job details. The
format for the view-job ACL is a comma-delimited list of users and a
comma-delimited list of groups that are allowed to view job details,
identical in format to those described in “Service-Level Authorization” and
depicted in Table 6-1. A special value of * can be used to signify that all
users are granted access to that protocol (this is the default setting).
Regardless of the value of this setting, the job owner, the cluster owner,
MapReduce administrators, and administrators of the queue to which the
job was submitted always have access to view a job. This ACL controls
access to job-level counters, task-level counters, a task’s diagnostic
information, task logs displayed on the TaskTracker web UI, and the
job.xml shown by the JobTracker’s web UI.
mapreduce.job.acl-modify-job
The access control list for users that are allowed to kill a job, kill a task,
fail a task, and set the priority of a job. The format for the modify-job ACL
is a comma-delimited list of users and a comma-delimited list of groups
that are allowed to modify the job, identical in format to those described in
“Service-Level Authorization” and depicted in Table 6-1. A special value
of * can be used to signify that all users are granted access to that protocol
(this is the default setting). Regardless of the value of this setting, the job
owner, the cluster owner, MapReduce administrators, and administrators of
the queue to which the job was submitted always have access to modify a
job.
For deployments where you want a default deny policy for access to job
details, a sensible default value for both settings is a single space, “ ” (without
the quotes). This will deny access to job details to all users except the job
owner, queue administrators, cluster administrators, and cluster owner.
NOTE
In order to control access to job details in the JobTracker web UI, you must configure
MapReduce ACLs as described earlier, as well as enable web UI authentication as described
in Chapter 11.
YARN (MR2)
With YARN/MR2, queue ACLs are no longer defined globally and each
scheduler provides its own method of defining ACLs. ACLs are still enabled
globally and there is a global ACL that defines YARN administrators. The
settings to enable YARN ACLs and to define the admins are configured in the
yarn-site.xml. Example values are provided in Example 6-8.
Example 6-8. YARN ACL configuration in yarn-site.xml
![]()
![]()

Because each scheduler is configured differently, we will walk through setting
up queue ACLs one scheduler at a time. For both examples, we will implement
the same use case. Our cluster is primarily used for running production ETL
pipelines, as well as production queries that generate regular reports. There is
some ad hoc reporting as well, but production jobs should always take priority.
In order to control access, we define two additional groups of users that
contain only a subset of the hadoop-users we defined earlier. The production-group contains users that run production ETL jobs and the production-group contains users that run production queries. For this example,
Alice is a member of the production-etl group while Bob is a member of the
production-queries group. Let’s start by configuring the FairScheduler.
FairScheduler
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
In order to guarantee the resources needed by the production jobs, we must first
disable the default behavior of the FairScheduler, which is to place each user
into their own queue that matches their username. This is done by setting two
parameters, and
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
, to . The first
parameter changes the default queue to and the second ensures that
users can’t submit jobs to queues that have not be predefined. These settings, as
well as the setting to enable the FairScheduler, are found in Example 6-9.
Example 6-9. FairScheduler configuration in yarn-site.xml
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()

![]()
![]()
![]()
![]()
![]()

![]()
![]()
![]()
![]()
![]()
![]()
![]()
Next, we must define the queues and their ACLs within the fair-scheduler.xmlfile. The FairScheduler uses a hierarchical queue system and each queue is a
descendant of the root queue. In our example, we want to provide 90% of the
cluster resources to production jobs and 10% to ad hoc jobs. To achieve this,
we define two direct children of the root queue: for production jobs and
![]()
![]()
![]()
![]()
![]()
![]()
for ad hoc jobs. We use the name “default” for the ad hoc queue
because that is the queue jobs are submitted to if a queue is not specified.
Resource management is a complex topic and we could tweak a lot of different
settings to control the resources just so. Because our focus is on security, we’ll
use a simplified scheme and just control the resources with the weight of the
queues. All that you need to understand is that for all queues that share a
common parent, their resource allocation is defined as their weight divided by
![]()
![]()
the weight of all of their siblings. In this case, we can assign a weight of
![]()
![]()
![]()
![]()
![]()
![]()
9.0 and a weight of 1.0 to get the desired 90/10 split.
![]()
![]()
![]()
We also want to break up the production queue into two subqueues: one for
ETL jobs and one for queries. For this example, we’ll leave the two queues
equally weighted by setting both queues to a weight of 1.0. It is important to
note that the calculation of fair share happens in the context of your parent
queue. In this example, that means that because we’re giving both the and
![]()
![]()
![]()
![]()
![]()
![]()
![]()
queues 50% of the resources of the queue, they’ll end up with a
global fair share of 45% each (50% × 90% = 45%).
Just as resources are inherited, so too are ACLs. With the FairScheduler, any
user that has permission to submit jobs to a queue also has permission to
submit jobs to any descendant queues. The same applies to users with
administrative privileges to a queue. In keeping with earlier examples, we
want any member of the hadoop-admins group to be able to administer any
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
job/queue, so we add them to the
ACL of the root queue.
It’s also worth noting that you must set the
ACL to " " (without
the quotes), otherwise any user could submit to any queue, as the default ACL
when one is not defined is to allow all. For the default queue, we want to allow
any member of the hadoop-users group permission to submit jobs, so we set
to (without the quotes, and note the leading
space). The and
queues have
set to
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
and (without the quotes),
respectively, as these are the groups we defined earlier. The complete
configuration for the FairScheduler is shown in Example 6-10.
![]()
Example 6-10. fair-scheduler.xml
![]()
![]()
![]()
![]()
![]()

![]()
![]()
![]()
![]()

![]()
![]()
![]()
![]()
![]()
![]()
![]()
Now let’s see what happens when Bob tries to kill one of Alice’s job’s without
having queue ACLs defined. First, Bob gets a list of running jobs to find the
![]()
![]()
![]()
![]()
for Alice’s job. Then he requests that the job be killed. Because there
are no controls over who is and isn’t allowed to administer jobs, YARN will
happily oblige his request. See Example 6-11 for the complete listing of Bob’s
user session.
Example 6-11. Killing another user’s job when no ACLs are defined
![]()
![]()
![]()
![]()
![]()
![]()
![]()
This is less than ideal, as users can interfere with one another’s production
jobs. More importantly, a simple copy/paste error could result in a user
accidentally killing another user’s job. If we try the same exact process after
configuring ACLs in the FairScheduler, we instead get the result shown in
Example 6-12.
![]()
![]()
Example 6-12. Bob is denied administrative permissions by the queue ACLs
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
org.apache.hadoop.yarn.exceptions.Yar
nException: java.security.AccessControlException: User bob cannot perform
operation M
ODIFY_APP on application_1396192703139_0001
at
org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
...
[bob@hadoop01 ~]$
There are almost always times when some admin must be able to kill another
user’s jobs, which is why we configured admin access to the hadoop-adminsgroup on the root queue. So if Joey, one of the Hadoop administrators, attempts
to kill a job, it will proceed as shown in Example 6-13.
Example 6-13. Successfully killing a MapReduce job
[joey@hadoop01 ~]$ mapred job -list
Total jobs:1
JobId State StartTime UserName
job_1396192703139_0002 RUNNING 1396193202565 alice
[joey@hadoop01 ~]$ mapred job -kill job_1396192703139_0002
Killed job job_1396192703139_0002
Controlling administrative access is obviously useful, but it’s also helpful to
prevent users from submitting jobs to the wrong queue. In our example, Alice
has permission to submit jobs to the prod.etl queue because she is a member
of the production-etl group. However, she is not a member of the production-
queries group, so if she tries to submit a job there, she will be denied, as
shown in Example 6-14.
Example 6-14. Alice is not allowed to submit jobs to the prod.queries queue
[alice@hadoop01 ~]$ yarn jar \
/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar
\
randomtextwriter -Dmapreduce.job.queuename=prod.queries random-text
...
Job started: Sun Mar 30 13:20:57 EDT 2014
14/03/30 13:20:59 ERROR security.UserGroupInformation:
PriviledgedActionException
as:alice@CLOUDERA (auth:KERBEROS) cause:java.io.IOException: Failed to run job
:
User alice cannot submit applications to queue root.prod.queries
...
CapacityScheduler
The CapacityScheduler supports hierarchical queues just like the
FairScheduler. It also supports the same per-queue ACLs and the same ACL
inheritance policy of the FairScheduler. In fact, from a security perspective, the
two schedulers are identical and only differ in the format of their configuration
files. In order to implement the same polices described earlier, you must first
enable the CapacityScheduler in the yarn-site.xml file, as shown in Example 6-
15.
Example 6-15. CapacityScheduler configuraiton in yarn-site.xml
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Once enabled, the CapacityScheduler reads its configuration from a file called
capacity-scheduler.xml. A sample configuration that implements the same
queues and ACLs is shown in Example 6-16. For the FairScheduler, the ACLs
are configured as child elements of the queue definition using the
tag to control who can submit applications to a queue and the
tag to control who can administer the jobs in a queue.
The equivalent settings for the CapacityScheduler are these properties,
![]()
![]()
![]()
respectively, with the replaced with a queue’s hierarchy:
![]()
![]()
![]()
![]()
![]()
![]()
For example, the name of the property that defines the prod.etl queue’s ACL is
,
as shown in Example 6-16.
Example 6-16. capacity-scheduler.xml
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()

![]()
![]()
![]()

![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()

![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()

![]()
![]()
![]()
![]()
![]()
![]()

![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()

![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()

![]()
![]()
![]()
![]()
![]()
![]()
![]()

![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
ZooKeeper ACLs
Apache ZooKeeper controls access to ZNodes (paths) through the use of access
control lists (ACLs). ZooKeeper’s ACLs are similar to POSIX permission bits,
but are more flexible because permissions are set on a per-user basis rather
than based on owner and primary group. In fact, ZooKeeper doesn’t have the
notion of owners or groups. As described in “Username and Password
Authentication”, users are specified by an authentication scheme and a scheme-
specific ID. The format for the IDs varies by the scheme.
An individual ACL has a scheme, ID, and the permissions. The list of available
permissions is shown in Table 6-5. It’s important to note that in ZooKeeper,
permissions are not recursive; they apply only the ZNode that they are attached
to, not to any of its children. Because ZooKeeper doesn’t have the notion of
![]()
![]()
owners for ZNodes, a user must have the permission on a ZNode to be
able to set the ACLs.
Table 6-5. ZooKeeper ACL permissions
Permission Description
CREATE Permission to create a child ZNode
READ Permission to get data from a ZNode and to list its children
WRITE Permission to set the data for a ZNode
DELETE Permission to delete children ZNodes
ADMIN Permission to set ACLs
![]()
![]()
![]()
![]()
![]()
![]()
![]()
The and permissions are used to control who can create
![]()
![]()
![]()
children of a ZNode. The use case that motivates granting but not
![]()
![]()
![]()
![]()
is when you want a path in which users can create children but only an
administrator can delete children.
![]()
![]()
![]()
![]()
If you’re adding ACLs using the Java API, you’ll first create an object with
the scheme and ID, and then create an oject with the and the
permissions as an integer. You can manually calculate a permission value or
use the constants in the class to get the combined permission
integer for the permissions you want to set. See Example 6-17 for sample Java
code for setting the ACL on a path.
![]()
![]()
![]()
![]()
![]()
Example 6-17. Setting ZooKeeper ACLs with the Java API
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
The scheme was described in “Username and Password
Authentication”, but ZooKeeper supports a number of other built-in schemes.
Table 6-6 describes the available schemes and the format of an ID when used
in an ACL. For the scheme , the only ID is the literal string . The
digest scheme uses the base64 encoding of the sha1 digest of the
<username>:<password> string. The ip scheme lets you set ACLs based on
an IP address or a range using CIDR notation. Finally, the sasl scheme uses
the <principal> as the ID. By default, the principal is the full UPN of the user.
You can control how to canonicalize the principal by setting the
kerberos.removeRealmFromPrincipal and/or
kerberos.removeHostFromPrincipal to remove the realm and second
component, respectively, before comparing the IDs.
Scheme | Description | ACL ID format |
world | Represents any user | anyone |
digest | Represents a user | <username>:base64(sha1sum(<username>:<password>)) |
that is authenticated | ||
with a password | ||
ip | Uses the client IP | <ip>[/<cidr>] |
address as an | ||
identity | ||
sasl | Represents a SASL | <principal> |
authenticated user | ||
(e.g., a Kerberos | ||
user) |
Apache Oozie has a very simple authorization model with two levels of
accounts: users and admin users. Users have the following permissions:
Read access to all jobs
Write access to their own jobs
Write access to jobs based on a per-job access control list (list of users and
groups)
Read access to admin operations
Admin users have the following permissions:
Write access to all jobs
Writes access to admin operations
You can enable Oozie authorization by setting the following parameters in the
oozie-site.xml file:
![]()
![]()
![]()
![]()
![]()
![]()
![]()

![]()
![]()
![]()
![]()
![]()
![]()
If you don’t set the ![]()
parameter, then you can specify a list of admin users, one per line, in the
adminusers.txt file:
![]()
![]()
![]()
In addition to owners and admin users having write access to a job, users can
be granted write privileges through the use of a job-specific access control list.
An Oozie ACL uses the same syntax as Hadoop ACLs (see Table 6-1) and is
set in the
property of a workflow, coordinator, or bundle
job.properties file when submitting a job.
HBase and Accumulo Authorization
Apache HBase and Apache Accumulo are sorted, distributed key/value stores
based on the design of Google’s BigTable and built on top of HDFS and
ZooKeeper. Both systems share a similar data model and are designed to
enable random access and update workloads on top of HDFS which is a write-
once filesystem. Data is stored in rows that contain one or more columns.
Unlike a relational database, the columns in each row can differ. This makes it
easier to implement complex data models where not every record shares the
same schema. Each row is indexed with a primary key called a row id or row; and within a row, each value is further indexed by a column key and
timestamp. The intersection of a row key, column key and timestamp, along
with the value they point to, is often called a cell. Internally, HBase and
Accumulo store data as a sorted sequence of key/value pairs with the key
consisting of the row ID, column key, and timestamp. Column keys are further
split into two components; a column family and a column qualifier. In HBase,
all of the columns in the same column family are stored in separate files on disk
whereas in Accumulo multiple column families can be grouped together into
locality groups.
A collection of sorted rows is called a table. In HBase, the set of column
families is predefined per table while Accumulo lets users create new column
families on the fly. In both systems, column qualifiers do not need to be
predefined and arbitrary qualifiers can be inserted into any row. A logical
grouping of tables, similar to a database or schema in a relational database
system, is called a namespace. Both HBase and Accumulo support permissions
at the system, namespace, and table level. The available permissions and their
semantics differ between Accumulo and HBase, so let’s start by taking a look
at Accumulo’s permission model.
System, Namespace, and Table-Level Authorization
At the highest level, Accumulo supports system permissions. Generally, system
permissions are reserved for the Accumulo root user or Accumulo
administrators. Permissions set at a higher level are inherited by objects at a
lower level. For example, if you have the system permission CREATE_TABLE,
you can create a table in any namespace even if you don’t have explicit
permissions to create tables in that namespace. See Table 6-7 for a list of
system-level permissions, their descriptions, and the equivalent namespace-
level permission.
WARNING
Throughout this section, you’ll see many references to the Accumulo root user. This is not the
same as the root system account. The Accumulo root user is automatically created when
Accumulo is initialized, and that user is granted all of the system-level permissions. The root
user can never have these permissions revoked, which prevents leaving Accumulo in a state
where no one can administer it.
Table 6-7. System-level permissions in Accumulo
Permission | Description | Equivalent namespace |
System.GRANT | Permission to grant permissions | Namespace.ALTER_NAMESPACE |
System.CREATE_TABLE | Permission to create tables | Namespace.CREATE_TABLE |
System.DROP_TABLE | Permission to delete tables | Namespace.DROP_TABLE |
System.ALTER_TABLE | Permission to modify tables | Namespace.ALTER_TABLE |
System.DROP_NAMESPACE | Permission to drop namespaces | Namespace.DROP_NAMESPACE |
System.ALTER_NAMESPACE | Permission to modify | Namespace.ALTER_NAMESPACE |
System.CREATE_USER | Permission to create new users | N/A |
System.DROP_USER | Permission to delete users | N/A |
System.ALTER_USER | Permission to change user | N/A |
System.SYSTEM | Permission to perform | N/A |
System.CREATE_NAMESPACE | Permission to create new | N/A |
Namespaces are a logical collection of tables and are useful for organizing
tables and delegating administrative functions to smaller groups. Suppose the
marketing department needs to host a number of Accumulo tables to power
some of its applications. In order to reduce the burden on the Accumulo
administrator, we can create a marketing namespace and give GRANT,
CREATE_TABLE, DROP_TABLE, and ALTER_TABLE permissions to an
administrator in marketing. This will allow the department to create and
manage its own tables without having to grant system-level permissions or wait
for the Accumulo administrator. A number of namespace-level permissions are
inherited by tables in the namespace. See Table 6-8 for the list of namespace-
level permissions, their descriptions, and the equivalent table-level
permission.
Table 6-8. Namespace-level permissions in Accumulo
Permission | Description | Equivalent table |
Namespace.READ | Permission to read (scan) tables in the | Table.READ |
Namespace.WRITE | Permission to write (put/delete) to | Table.WRITE |
Namespace.GRANT | Permission to grant permissions to | Table.GRANT |
Namespace.BULK_IMPORT | Permission to bulk import data into | Table.BULK_IMPORT |
Namespace.ALTER_TABLE | Permission to set properties on tables in | Table.ALTER_TABLE |
Namespace.DROP_TABLE | Permission to delete tables in the | Table.DROP_TABLE |
Namespace.CREATE_TABLE | Permission to create tables in the | N/A |
Namespace.ALTER_NAMESPACE | Permission to set properties on the | N/A |
Permission to delete the namespace |
Namespace.DROP_NAMESPACE N/A
Table-level permissions are used to control coarse-grained access to
individual tables. Table 6-9 contains a list of table-level permissions and their
descriptions.
Table 6-9. Table-level permissions in Accumulo
Permission Description
Table.READ Permission to read (scan) the table
Table.WRITE Permission to write (put/delete) to the table
Table.BULK_IMPORT Permission to bulk import data into the table
Table.ALTER_TABLE Permission to set properties on the table
Table.GRANT Permission to grant permissions to the table
Table.DROP_TABLE Permission to delete the table
System, namespace, and table-level permissions can be managed using the
Accumulo shell. In particular, permissions are granted using the grant
command and can be revoked using the revoke command. See Example 6-18
for an example of using the Accumulo shell to administer permissions.
Example 6-18. Administering permissions using the Accumulo shell
root@cloudcat> userpermissions -u alice
System permissions:
Namespace permissions (accumulo): Namespace.READ
Table permissions (accumulo.metadata): Table.READ
Table permissions (accumulo.root): Table.READ
root@cloudcat> user alice
Enter password for user alice: *****
alice@cloudcat> table super_secret_squirrel
alice@cloudcat super_secret_squirrel> scan
2014-03-31 16:11:06,828 [shell.Shell] ERROR: java.lang.RuntimeException:
org.apache.accumulo.core.client.AccumuloSecurityException: Error
PERMISSION_DENIED
for user alice on table super_secret_squirrel(ID:a) - User does not have
permission
to perform this action
alice@cloudcat super_secret_squirrel> user root
Enter password for user root: ******
root@cloudcat super_secret_squirrel> grant Namespace.READ -ns "" -u alice
root@cloudcat super_secret_squirrel> user alice
Enter password for user alice: *****
alice@cloudcat super_secret_squirrel> scan
r f:c [] value
alice@cloudcat super_secret_squirrel>
HBase uses the same set of permissions (Table 6-10) for ACLs at the system,
namespace, and table level. Permissions granted at a higher level are inherited
by objects at the lower level. For example, if you grant system-level READ
permissions to a user, that user can read all tables in the cluster. HBase
supports assigning permissions to groups as well as individual users. Group
permissions are assigned by prefixing the group name with an @ when using the
grant shell command. HBase uses the same user-to-group mapping classes that
come with Hadoop. Group mapping defaults to loading the Linux groups on the
HBase Master and supports using LDAP groups or a custom mapping.
Table 6-10. Permissions in HBase
Permission Description
READ (R) Permission to read (get/scan) data
WRITE (W) Permission to write (put/delete)
EXEC (X) Permission to execute coprocessor endpoints
CREATE
(C)
Permission to drop the table; alter table attributes; and add, alter, or drop column
families
ADMIN (A) Permission to enable and disable the table, trigger region reassignment or relocation,
and the permissions granted by CREATE
Example 6-19 takes a look at using system-level permissions to grant read
access to all tables. First, Alice brings up the HBase shell, gets a list of tables,
and attempts to scan the super_secret_squirrel table.
Example 6-19. Alice is denied access to an HBase table
[alice@cdh5-hbase ~]$ hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.98.0, rUnknown, Fri Feb 7 12:26:17 PST 2014
hbase(main):001:0> list
TABLE
super_secret_squirrel
1 row(s) in 2.2110 seconds
hbase(main):002:0> scan 'super_secret_squirrel'
ROW COLUMN+CELL
ERROR: org.apache.hadoop.hbase.security.AccessDeniedException: Insufficient
permissions for user 'alice' for scanner open on table super_secret_squirrel
hbase(main):003:0> user_permission
User
Table,Family,Qualifier:Permission
0 row(s) in 0.7350 seconds
Notice that when Alice executes the user_permission command, she is
nowhere to be found. Alice asks the HBase administrator to grant her access to
all the tables in HBase. The admin logs into the HBase shell as the hbase user
and uses the grant command to give Alice READ permissions at the system
level.
Example 6-20. HBase admin grants Alice system-level READ permissions
[hbase@cdh5-hbase ~]$ hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.96.1.1-cdh5.0.0-beta-2, rUnknown, Fri Feb 7 12:26:17 PST 2014
hbase(main):001:0> grant 'alice', 'R'
row(s) in 2.6990 seconds
hbase(main):002:0> user_permission 'super_secret_squirrel'
User
Table,Family,Qualifier:Permission
hbase super_secret_squirrel,,:
[Permission:
actions=READ,WRITE,EXEC,CREATE,ADMIN]
row(s) in 0.2140 seconds
Notice that Alice still doesn’t have permissions specific to the
super_secret_squirrel table as she was granted access at the system level.
Permissions at the system level are displayed in the shell as applying to the
hbase:acl table, as shown in Example 6-21. Now when Alice executes a
scan, she gets back the rows from the table.
Example 6-21. Alice can now scan any HBase table
hbase(main):004:0> user_permission
User
Table,Family,Qualifier:Permission
alice hbase:acl,,: [Permission:
actions=REA
D]
1 row(s) in 0.1540 seconds
hbase(main):005:0> scan 'super_secret_squirrel'
ROW COLUMN+CELL
r column=f:q,
timestamp=1396369612376,
value=value
1 row(s) in 0.1310 seconds
Column- and Cell-Level Authorization
HBase and Accumulo also support fine-grained authorization at the data level.
In HBase, you can specify permissions down to the column level. Only the
READ and WRITE permissions are applicable to column-level ACLs. Because
HBase supports assigning permissions to groups, this is a form of role-based
In Accumulo, security labels are stored at the cell level and each key/value
pair has its own label. Accumulo stores the security labels as part of the key by
extending the BigTable data model with a visibility element between the
column qualifier and timestamp. Like all of the elements of Accumulo’s keys,
security labels do not need to be predefined and can be created when data is
inserted. In order to support more complex combinations of permissions,
security labels consist of a set of user-defined tokens that are combined using
the boolean | and & operators. Parentheses can also be used to specify
precedence of the boolean operators.
In this chapter, we covered authorization for permitting or denying access to
data and services in the cluster. Setting permissions and ACLs to control
access to data and resources is fundamental in Hadoop administration. We saw
that authorization controls look a bit different from component to component,
especially the differences between those that authorize access to data (HDFS,
HBase, Accumulo) and those that authorize access to processing and resources
(MapReduce, YARN).
Chapter 7. Apache Sentry (Incubating)
Over the lifetime of the various Hadoop ecosystem projects, secure authorization has been added in a
variety of different ways. It has become increasingly challenging for administrators to implement and
maintain a common system of authorization across multiple components. To compound the problem, the
various components have different levels of granularity and enforcement of authorization controls, which
often leave an administrator confused as to what a given user can actually do (or not do) in the Hadoop
environment. These issues, and many others, were the driving force behind the proposal for Apache Sentry
(Incubating).
The Sentry proposal identified a need for fine-grained role-based access controls (RBAC) to give
administrators more flexibility to control what users can access. Traditionally, and covered already, HDFS
authorization controls are limited to simple POXIS-style permissions and extended ACLs. What about
frameworks that work on top of HDFS, such as Hive, Cloudera Impala, Solr, HBase, and others? Sentry’s
goals are to implement authorization for Hadoop ecosystem components in a unified way so that security
administrators can easily control what users and groups have access to without needing to know the ins and
outs of every single component in the Hadoop stack.
Each component that leverages Sentry for authorization must have a Sentry binding. The binding is a plug-in
that the component uses to delegate authorization decisions to Sentry. This binding applies the relevant
model to use for authorization decisions. For example, a SQL model would apply for the components Hive
and Impala, a Search model would apply to Solr, and a BigTable model would apply to HBase and
Accumulo. Sentry privilege models are discussed in detail a bit later.
With the appropriate model in place, Sentry uses a policy engine to determine if the requested action is
authorized by checking the policy provider. The policy provider is the storage mechanism for the policies,
such as a database or text file. Figure 7-1 shows how this looks conceptually.

This flow makes sense for how components leverage Sentry at a high level, but what about the actual
decision-making process for authorization by the policy engine? Regardless of the model in place for a
given component, there are several key concepts that are common. Users are what you expect them to be.
They are identities performing a specific action, such as executing a SQL query, searching a collection,
reading a file, or retrieving a key/value pair. Users also belong to groups. In the Sentry context, groups are
a collection of users that have the same needs and privileges. A privilege in Sentry is a unit of data access
and is represented by a tuple of an object and an action to be performed on the object. For example, an
object could be a DATABASE, TABLE, or COLLECTION, and the action could be CREATE, READ, WRITE.
Important
Sentry privileges are always defined in the positive case because, by default, Sentry denies access to every
object. This is not to be confused with REVOKE syntax covered later, which simply removes the positive
case privileges.
Lastly, a role is a collection of privileges and is the basic unit of grant within Sentry. A role typically aligns
with a business function, such as a marketing analyst or database administrator. The relationship between
users, groups, privileges, and roles is important in Sentry, and adheres to the following logic:
A group contains multiple users
Arole is assigned a group
A role is granted privileges
This is illustrated in Figure 7-2.
Figure 7-2. Sentry entity relationships

This relationship is strictly enforced in Sentry. It is not possible to assign a role to a user or grant privileges
to a group, for example. While this relationship is strict, there are several many-to-many relationships in
play here. A user can belong to many groups and a group can contain many users. For example, Alice could
belong to both the Marketing and Developer groups, and the Developer group could contain both Alice and
Bob.
Also, a role can be assigned to many groups and a group can have many roles. For example, the SQL
Analyst role could be assigned to both the Marketing and Developer groups, and the Developer group could
have both the SQL Analyst role and Database Administrator role.
Lastly, a role can be granted many privileges and a given privilege can be a part of many roles. For
example, the SQL Analyst role could have SELECT privileges on the clickstream TABLE and CREATE
privileges on the marketing DATABASE, and the same CREATE privilege on the marketing DATABASE could
also be granted to the Database Administrator role.
Furthermore, Sentry policies were configured in a plain text file that enumerated every policy. Whenever a
policy was added, modified, or removed, it required a modification to the file. As you might imagine, this
approach is rather simplistic, cumbersome to maintain, and prone to errors. To compound the problem,
mistakes made in the policy file invalidated the entire file!
Thankfully, Sentry has largely moved beyond this early beginning and has grown into a first-class citizen in
the Hadoop ecosystem. Starting with version 1.4, Sentry comes with a service that can be leveraged by
Hive and Impala. This service utilizes a database backend instead of a text file for policy storage.
Additionally, services that use Sentry are now configured with a binding that points to the Sentry service
instead of a binding to handle all of the authorization decisions locally. Because of advancements in
Sentry’s architecture, it is not recommended to use the policy file–based configuration for Hive and Impala
except on legacy systems. That being said, this chapter will include information about both configuration
options. Figure 7-3 depicts how the Sentry service fits in with SQL access.
Figure 7-3. Sentry service architecture

At the time of this writing, Solr still utilizes policy files. It is expected that Solr as well as any other new
Sentry-enabled services will move away from using policy file–based configurations.
Sentry Service Configuration
The first part of getting Sentry up and running in the cluster is to configure the Sentry service. The master
configuration file for Sentry is called sentry-site.xml. Example 7-1 shows a typical configuration for the
Sentry server in a Kerberos-enabled cluster, and Table 7-1 explains the configuration parameters. Later on
in the chapter, we will take a look at how Hadoop ecosystem components utilize this Sentry service for
authorization.
Example 7-1. Sentry service sentry-site.xml
![]()
![]()


Table 7-1 shows all of the relevant configuration parameters for sentry-site.xml. This includes parameters
that are used for configuring the Sentry service, as well as configurations for policy file–based
implementations and component-specific configurations.
Table 7-1. sentry-site.xml configurations
Configuration Description
![]()
![]()
![]()
![]()
Typically
![]()
![]()
![]()
![]()
![]()
![]()
![]()
for Hadoop groups; local groups can be defined only in a policy-file deployment and use
The location of the policy file; can be both and URIs
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
hive.sentry.server The name of the Sentry server; can be anything
Type of Sentry service:
or
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
List of users allowed to bypass Sentry policies for the Hive metastore; only applies to Sentry
service deployments
Configuration Description
sentry.provider Same options as hive.sentry.provider; used by Solr
sentry.service.admin.group List of comma-separated groups that are administrators of the Sentry server
sentry.service.allow.connect
List of comma-separated users that are allowed to connect; typically only service users, not end
users
sentry.service.client.server.rpc-
address
Client configuration of the Sentry service endpoint
sentry.service.client.server.rpc- Client configuration of the Sentry service port
port
sentry.service.security.mode The security mode the Sentry server is operating under; kerberos or none
sentry.service.server.keytab Keytab filename that contains the credentials for sentry.service.server.principal
sentry.service.server.principal Service principal name contained in sentry.service.server.keytab that the Sentry server
identifies itself as
sentry.service.server.rpc-address The hostname to start the Sentry server on
sentry.service.server.rpc-port The port to listen on
sentry.solr.provider.resource The location of the policy file for Solr; can be both file:// and hdfs:// URIs
sentry.store.jdbc.driver The JDBC driver name to use to connect to the database
sentry.store.jdbc.password The JDBC password to use
sentry.store.jdbc.url The JDBC URL for the backend database the Sentry server should use
sentry.store.jdbc.user The JDBC username to connect as
sentry.store.group.mapping The class that provides the mapping of users to groups; typically
org.apache.sentry.provider.common.HadoopGroupMappingService
With older versions of Hive, this was pretty much all you had; the Hive client API would talk directly to the
metastore database and perform operations. From a security standpoint, this is bad. This model meant that
every Hive client had the full credentials to the Hive metastore database! The Hive Metastore Server
became a component of the Hive architecture to address this problem, among others. This role’s purpose is
to become a middle layer between Hive clients and the metastore database. With this model, clients need
only to know how to contact the Hive Metastore Server, whereas only the Hive Metastore Server holds the
keys to the underlying metastore database.
The last component of the Hive architecture is HiveServer2. This component’s purpose is to provide a
query service to external applications using interfaces such as JDBC and ODBC. HiveServer2 fields
requests from clients, communicates with the Hive Metastore Server to retrieve metadata information, and
performs Hive actions as appropriate, such as spawning off MapReduce jobs. As the name implies,
HiveServer2 is the second version of such a service, with the initial version lacking concurrency and
security features. The important part to understand here is that HiveServer2 was initially meant to serve
external applications. The Hive command-line interface (CLI) was still interacting directly with the Hive
Metastore Server and using Hive APIs to perform actions. Users could use the CLI for HiveServer2,
beeline, to perform actions, but it was not required. This fact poses a challenge for enforcing secure
authorization for all clients. As you might have guessed, the way to achieve this is to enforce secure
authorization for HiveServer2, and ensure that all SQL clients must use HiveServer2 to perform any and all
Hive SQL operations.
Another component of the Hive architecture is HCatalog. This is a set of libraries that allows non-SQL
clients to access Hive Metastore structures. This is useful for users of Pig or MapReduce to determine the
metadata structures of files without having to use traditional Hive clients. An extension of the HCatalog
libraries is the WebHCatServer component. This component is a daemon process that provides a REST
interface to perform HCatalog functions. Neither the HCatalog libraries, nor the WebHCatServer utilize
HiveServer2. All communication is directly to the Hive Metastore Server. Because of this fact, the Hive
Metastore Server must also be protected by Sentry to ensure HCatalog users cannot make arbitrary
modifications to the Hive Metastore database.
WARNING
While the 1.4 release of Sentry has the ability to provide write protection of the Hive Metastore Server, it does not currently limit
reads. What this means is that a user doing something equivalent to a SHOW TABLES operation in HCatalog will return a list of all
tables, including tables they do not have access to. This is different from the same operation performed via HiveServer2 where
the user only sees the objects they have access to. However, this is only metadata exposure. Permissions of the actual data are still
enforced at the time of access by HDFS. If your cluster does not have any users that utilize HCatalog, a way to force all Hive
traffic to HiveServer2 is to set the property hadoop.proxyuser.hive.groups in the core-site.xml configuration file to
hive,impala,which allows both Hive (HiveServer2) and Impala (Catalog Server) to directly access the Hive Metastore Server,
but nobody else.
Figure 7-4 shows how the Hive architecture is laid out and where the Sentry enforcements occur. As you
can see, regardless of the method of access, the key enforcement is protecting the Hive metastore from
unauthorized changes.
Figure 7-4. Hive sentry architecture

Hive Sentry Configuration
In this section, we take a look at what is necessary to configure Hive to leverage Sentry for authorization.
Example 7-2 shows the sentry-site.xml configuration file that is used by both the Hive Metastore Server
and HiveServer2 to leverage a Sentry service.
Example 7-2. Hive sentry-site.xml service deployment

![]()
![]()
![]()
![]()

Example 7-3 shows a typical configuration for Sentry when used with HiveServer2 (and Hive Metastore
Server) and a policy file–based deployment. The policy file–based configuration for Sentry is rather
minimal when compared to the service-based configuration, but there are commonalities. The location of
sentry-site.xml on the local filesystem is specified in the HiveServer2 daemon’s hive-site.xmlconfiguration file, as we will see later.
Example 7-3. Hive sentry-site.xml policy file deployment
![]()
![]()
![]()
![]()
![]()
![]()

First, we will look at the configuration properties that are used in both examples. The parameter
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
specifies a name (label) for this particular Sentry server, which can be referenced in
policies. This name has nothing to do with a machine hostname. The
configuration tells Hive which provider backend to use. For the Sentry service, this is
![]()
![]()
![]()
![]()
![]()
![]()
and for the Sentry policy file this is
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
.
configures the method that Sentry will use to determine group information.
, shown here, will leverage whatever method Hadoop is
configured with, such as reading groups from the local operating system or directly pulling group
information from LDAP. However, this is a mere formality in Example 7-2 because the Sentry service
cannot use a policy file to define local user-to-group mappings.
Next we will look at the configurations that are specific to the Sentry service example. The details of how
Hive should connect to the Sentry service are provided by the following:
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Both and set up the Kerberos
configuration details. Finally, the configuration lists the users that are
allowed to bypass Sentry authorization and connect directly to the Hive Metastore Server. This likely will
always be service/system users like Hive and Impala, as the example shows.
The remaining configuration that is specific to the policy file deployment example is
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
. This specifies the location where the policy file is. The location
specified for the Sentry policy file will assume the location is the same as is specified in hdfs-site.xml. For
example, it will assume the path /user/hive/sentry/sentry-provider.ini is in HDFS if hdfs-site.xml points to
HDFS. It is also possible to be explicit in the location by providing for an HDFS path or
for a local filesystem path.
While the sentry-site.xml configuration is important for Hive, on its own it does not enable Sentry
authorization for it. Additional configuration is necessary in Hive’s configuration file, hive-site.xml.
Example 7-4 shows the relevant configurations needed for the Hive Metastore Server, and Example 7-5
similarly shows what is needed for HiveServer2. The two configurations are similar, but slightly different.
The last hive-site.xml example shown in Example 7-6 shows what is needed for HiveServer2 in a policy
file–based deployment. Note that in a policy file–based deployment, no additional configuration is needed
for the Hive Metastore Server (more on that later).
Example 7-4. Hive Metastore Server hive-site.xml Sentry service configurations
![]()
![]()
![]()
![]()
![]()
![]()

Example 7-5. HiveServer2 hive-site.xml Sentry service configurations
![]()
![]()
![]()
![]()
![]()

![]()
![]()

![]()
Example 7-6. HiveServer2 hive-site.xml Sentry policy file configurations
![]()
![]()
![]()
![]()
![]()

![]()
![]()
![]()
![]()
![]()
In all three hive-site.xml examples, the configuration property tells Hive where to
locate the Sentry configuration file. In both HiveServer2 examples, the property
![]()
![]()
![]()
![]()
![]()
![]()
is used to specify a binding that will actually hand off authorization
decisions to Sentry.
Also in both HiveServer2 examples, notice that impersonation is disabled. Disabling Hive impersonation is
a critical piece of Sentry configuration. In order to truly have authorization that is enforced all the way from
query to data access, Sentry and Hive need to have control of both the query interface as well as file access.
To do this, HDFS permissions of the Hive warehouse need to be locked down, as shown in Example 7-7.
Example 7-7. Locking down the Hive warehouse
![]()
After locking down the Hive warehouse and disabling impersonation, Sentry controls authorization at the
query interface. HDFS permissions are locked down because only the Hive system user is able to access
the files. Not only is this better from a security perspective, but it also allows Sentry the ability to control
authorization down to the view level. Views can be used for column-level security (selecting only certain
![]()
columns) and as row-level security, such as providing a filtering clause. If impersonation is enabled
and queries are thus run as the end user, view-level permissions are not realistically enforced because the
user has file-level (e.g., table-level) access in HDFS and can bypass Sentry policies by accessing files
directly, such as with MapReduce.
Impala Authorization
The initial release of Sentry included support for both Hive and Impala. While both of these components
have some similarities, they have some fundamental differences in architecture that need to be addressed
before we can fully understand how Sentry fits into the equation. First, Impala is an entire processing
framework. This differs from Hive in that Hive does not have any processing power to do the actual work a
user is requesting. That work is handled by MapReduce by default (either standalone version 1, or version
2 on YARN).
Impala architecture consists of three components: Daemon, StateStore, and Catalog Service. The Impala
Daemon, or impalad, is the actual worker process, which runs on every node in the cluster that runs the
HDFS DataNode daemon. The Impala StateStore, or statestored, is responsible for keeping track of the
health of all of the impalad instances in the cluster. If an instance goes bad, statestored broadcasts this
information to all the rest of the impalad instances. While this might seem like a critical component of the
Impala architecture, it actually is not required. If the statestored process goes down or does not exist at all,
all of the work done by the impalad instances continues to operate. The only potential impact is if an
impalad instance goes into bad health, the remaining instances will be slow to discover this, which can lead
to a delay in total query execution time. The Impala Catalog Service, or catalogd, is responsible for
keeping track of metadata changes. If an Impala query executes on an impalad that somehow changes
metadata, the catalogd broadcasts the updated metadata to the other impalad instances. The catalogd is
responsible for communicating with the Hive Metastore server to retrieve all existing metadata information.
Now that the basics of Impala architecture have been reviewed, we can cover where Sentry actually comes
into play. As described earlier in our discussion of Hive, Sentry is a plug-in for Hive components
HiveServer2 and Hive Metastore Server, which most of the time are each single instances on a cluster. With
Impala, Sentry is not a centralized plug-in to augment a single main component, such as for the catalogd or
statestored processes. Sentry is actually enabled on every impalad. When a user connects to a given
impalad with Sentry enabled and issues a query, the impalad uses the Sentry policy (either from the Sentry
service or a policy file) to determine if the user is authorized to perform the requested action.
Impala Sentry Configuration
Like we did in the previous section with Hive, in this section we take a look at what is necessary to
configure Impala to leverage Sentry for authorization. Example 7-8 shows the sentry-site.xml configuration
file that is used by the Impala daemons to leverage a Sentry service. In a policy file–based deployment, a
sentry-site.xml file is not required.
Example 7-8. Impala sentry-site.xml service deployment
![]()
![]()

![]()
As you might have noticed, the sentry-site.xml configuration for Impala to use a Sentry service is a subset
of the configuration for Hive. The properties were already discussed in the last section, so we can move on
to configuring the Impala daemons to enable Sentry authorization.
In a Sentry service deployment, the Impala daemons need just three flags configured. The first flag is
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
, which is a label for the Sentry server. This matches the configuration
property. The second flag is , which points the Impala daemon to the location of the sentry-configuration file. The third flag, , is used to specify users that
serve as impersonators for other users, such as the user. Example 7-9 shows what this looks like.
Example 7-9. Impala flags for Sentry service deployment
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
In a Sentry policy file–based deployment, the Impala daemons do not need the flag.
![]()
Instead, the Impala daemons are configured with the and
![]()
![]()
flags. These flags indicate the location of the Sentry policy file
and the authorization provider class, respectively. The latter was described already with the
![]()
![]()
![]()
![]()
configuration property, which serves the same purpose. Example 7-10 shows how
this looks.
Example 7-10. Impala flags for Sentry policy file deployment
![]()
![]()
![]()
![]()
![]()
![]()

![]()
Solr Authorization
Authorization for Solr starts with collections. Collections are the main entry point of access, much like how
databases are for SQL. Sentry authorization initially started with defining privileges at the collection level.
Sentry has since evolved to provide document-level authorization. Document-level authorization is done by
tagging each document with a special field name containing the value that corresponds to an associated
Sentry role name defined in the Sentry policy file (described later). The tagging of documents in this fashion
would be done at ingest time, so it is important to have a good sense of role names to avoid needing to
reprocess documents to change tag values.
Solr Sentry Configuration
This section explains how to set up Solr with Sentry authorization. Example 7-11 shows what is needed in
the sentry-site.xml configuration file for the Solr servers.
Example 7-11. Solr sentry-site.xml policy file deployment
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()

![]()
![]()
![]()
![]()
![]()
![]()
![]()
The two configuration properties shown in Example 7-11 should look very familiar at this point, but the
configuration property names are slightly different with Solr. The configuration property
works just like the configuration for Hive and the
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
flag for Impala. The
configuration property specifies the location of the Sentry policy file. Again, this policy file can be located
either on the local filesystem or on HDFS. It needs to be readable by the user that the Solr servers are
running as (typically the solr user).
![]()
![]()
![]()
To set up the Solr servers with Sentry authorization, some environment variables are needed. These can
either be set as environment variables or as lines in the /etc/default/solr configuration file. The first
variable is . This obviously enables Sentry authorization when set to true. The next
is . This variable defines the user that has superuser privileges, which
![]()
typically should be the solr user. The last variable is , which
specifies the location of the sentry-site.xml configuration file described earlier. Example 7-12 shows how
this looks.
![]()
Example 7-12. Solr environment variables in Sentry policy file deployment
![]()
![]()
![]()
![]()
![]()
![]()
It was mentioned earlier in this section that document-level authorization can be used. In order to make that
happen, a few configurations are necessary for the collection. By default, collections are configured using
the solrconfig.xml configuration file. This file needs to look like Example 7-13.
Example 7-13. Document-level security solrconfig.xml

Example 7-13 shows that the class
![]()
![]()
![]()
![]()
![]()
![]()
![]()
is used for document-
![]()
![]()
![]()
![]()
![]()
level authorization decisions. It is turned on by setting the property to . The configuration
property
defines the name of the field in a document that contains the authorization tokento determine access. The default value of is shown, but this can be anything. Documents will
use this tag to insert the role name that is required to access the document. The last configuration property
![]()
defines the token that allows every role to access a given document. The default is *, and it
makes sense to leave that as is to remain consistent with wildcard matches in other Sentry privileges.
Sentry provides three types of privileges for SQL access: SELECT, INSERT, and ALL. These privileges
are not available for every object. Table 7-2 provides information on which privileges apply to which
object in a SQL context. The SQL privilege model itself is a hierarchy, meaning privileges to container
objects imply privileges to child objects. This is important to fully understand what users do or do not have
access to.
Table 7-2. SQL privilege typesa
Privilege Object
INSERT TABLE,URI
SELECT TABLE,VIEW,URI
ALL SERVER,DB,URI
a All privilege model tables are reproduced from cloudera.com with permission from Cloudera, Inc.
Table 7-3 lays out which container privilege yields the granular privilege on a given object. For example,
the first line in the table should be interpreted as “ALL privileges on a SERVER object implies ALL
privileges on a DATABASE object.”
Table 7-3. SQL privilege hierarchy
Base Object | Granular Privilege | Container Object | Container Privilege That Implies Granular Privilege |
DATABASE | ALL | SERVER | ALL |
TABLE | INSERT | DATABASE | ALL |
TABLE | SELECT | DATABASE | ALL |
VIEW | SELECT | DATABASE | ALL |
The final portion of the SQL privilege model is to understand how privileges map to SQL operations.
Table 7-4 shows for a given SQL operation, what object scope does the operation apply to, and what
privileges are required to perform the operation. For example, the first line in the table should be
interpreted as “CREATE DATABASE applies to the SERVER object and requires ALL privileges on the
SERVER object.” Some of the SQL operations involve more than one privilege, such as creating views.
Creating a new view requires ALL privileges on the DATABASE in which the view is to be created, as
well as SELECT privileges on the TABLE/VIEW object(s) referenced by the view.
Table 7-4. SQL privileges
SQL Operation | Scope | Privileges |
CREATE DATABASE | SERVER | ALL |
DROP DATABASE | DATABASE | ALL |
CREATE TABLE | DATABASE | ALL |
DROP TABLE | TABLE | ALL |
CREATE VIEW | DATABASE; SELECT on TABLE | ALL |
DROP VIEW | VIEW/TABLE | ALL |
CREATE INDEX | TABLE | ALL |
DROP INDEX | TABLE | ALL |
ALTER TABLE ADD COLUMNS | TABLE | ALL |
ALTER TABLE REPLACE COLUMNS | TABLE | ALL |
ALTER TABLE CHANGE column | TABLE | ALL |
ALTER TABLE RENAME | TABLE | ALL |
ALTER TABLE SET TBLPROPERTIES | TABLE | ALL |
ALTER TABLE SET FILEFORMAT | TABLE | ALL |
ALTER TABLE SET LOCATION | TABLE | ALL |
ALTER TABLE ADD PARTITION | TABLE | ALL |
ALTER TABLE ADD PARTITION location | TABLE | ALL |
ALTER TABLE DROP PARTITION | TABLE | ALL |
ALTER TABLE PARTITION SET FILEFORMAT | TABLE | ALL |
SHOW TBLPROPERTIES | TABLE | SELECT/INSERT |
SHOW CREATE TABLE | TABLE | SELECT/INSERT |
SHOW PARTITIONS | TABLE | SELECT/INSERT |
DESCRIBE TABLE | TABLE | SELECT/INSERT |
DESCRIBE TABLE PARTITION | TABLE | SELECT/INSERT |
LOAD DATA | TABLE; URI | INSERT |
SELECT | TABLE | SELECT |
INSERT OVERWRITE TABLE | TABLE | INSERT |
CREATE TABLE AS SELECT | DATABASE; SELECT on TABLE | ALL |
USE database | ANY | ANY |
ALTER TABLE SET SERDEPROPERTIES | TABLE | ALL |
ALTER TABLE PARTITION SET SERDEPROPERTIES | TABLE | ALL |
CREATE ROLE | SERVER | ALL |
SQL Operation | Scope | Privileges |
GRANT ROLE TO GROUP | SERVER | ALL |
GRANT PRIVILEGE ON SERVER | SERVER | ALL |
GRANT PRIVILEGE ON DATABASE | DATABASE | WITH GRANT OPTION |
GRANT PRIVILEGE ON TABLE | TABLE | WITH GRANT OPTION |
While most of the SQL operations are supported by both Hive and Impala, some operations are supported
only by Hive or Impala, or have not been implemented yet. Table 7-5 lists the SQL privileges that only
apply to Hive, and Table 7-6 lists the SQL privileges that only apply to Impala.
Table 7-5. Hive-only SQL privileges
SQL Operation | Scope | Privileges |
INSERT OVERWRITE DIRECTORY | TABLE; URI | INSERT |
ANALYZE TABLE | TABLE | SELECT + INSERT |
IMPORT TABLE | DATABASE; URI | ALL |
EXPORT TABLE | TABLE; URI | SELECT |
ALTER TABLE TOUCH | TABLE | ALL |
ALTER TABLE TOUCH PARTITION | TABLE | ALL |
ALTER TABLE CLUSTERED BY SORTED BY | TABLE | ALL |
ALTER TABLE ENABLE/DISABLE | TABLE | ALL |
ALTER TABLE PARTITION ENABLE/DISABLE | TABLE | ALL |
ALTER TABLE PARTITION RENAME TO PARTITION | TABLE | ALL |
ALTER DATABASE | DATABASE | ALL |
DESCRIBE DATABASE | DATABASE | SELECT/INSERT |
SHOW COLUMNS | TABLE | SELECT/INSERT |
SHOW INDEXES | TABLE | SELECT/INSERT |
Table 7-6. Impala-only SQL privileges
SQL Operation Scope Privileges
EXPLAIN TABLE SELECT
INVALIDATE METADATA SERVER ALL
INVALIDATE METADATA table TABLE SELECT/INSERT
REFRESH table TABLE SELECT/INSERT
CREATE FUNCTION SERVER ALL
DROP FUNCTION SERVER ALL
SQL Operation Scope Privileges
COMPUTE STATS TABLE ALL
With Solr, Sentry provides three types of privileges: QUERY, UPDATE, and * (ALL). The privilege model for
Solr is broken down between privileges that apply to request handlers and those that apply to collections. In
Tables 7-8 through 7-10, the admin collection name is a special collection in Sentry that is used to
represent administrative actions. In all of the Solr privilege model tables, collection1 denotes an arbitrary
collection name.
Table 7-7. Solr privilege table for nonadministrative request
handlers
Request handler | Required privilege | Collections that require privilege |
select | QUERY | collection1 |
query | QUERY | collection1 |
get | QUERY | collection1 |
browse | QUERY | collection1 |
tvrh | QUERY | collection1 |
clustering | QUERY | collection1 |
terms | QUERY | collection1 |
elevate | QUERY | collection1 |
analysis/field | QUERY | collection1 |
analysis/document | QUERY | collection1 |
update | UPDATE | collection1 |
update/json | UPDATE | collection1 |
update/csv | UPDATE | collection1 |
Table 7-8. Solr privilege table for collections admin actions | ||
Collection action | Required privilege | Collections that require privilege |
create | UPDATE | admin, collection1 |
delete | UPDATE | admin, collection1 |
reload | UPDATE | admin, collection1 |
createAlias | UPDATE | admin, collection1 |
deleteAlias | UPDATE | admin, collection1 |
syncShard | UPDATE | admin, collection1 |
splitShard | UPDATE | admin, collection1 |
Collection action Required privilege Collections that require privilege
deleteShard UPDATE admin, collection1
Table 7-9. Solr privilege table for core admin actions
Collection action | Required privilege | Collections that require privilege |
create | UPDATE | admin, collection1 |
rename | UPDATE | admin, collection1 |
load | UPDATE | admin, collection1 |
unload | UPDATE | admin, collection1 |
status | UPDATE | admin, collection1 |
persist | UPDATE | admin |
reload | UPDATE | admin, collection1 |
swap | UPDATE | admin, collection1 |
mergeIndexes | UPDATE | admin, collection1 |
split | UPDATE | admin, collection1 |
prepRecover | UPDATE | admin, collection1 |
requestRecover | UPDATE | admin, collection1 |
requestSyncShard | UPDATE | admin, collection1 |
requestApplyUpdates | UPDATE | admin, collection1 |
Request handler | Required privilege | Collections that require privilege |
LukeRequestHandler | QUERY | admin |
SystemInfoHandler | QUERY | admin |
SolrInfoMBeanHandler | QUERY | admin |
PluginInfoHandler | QUERY | admin |
ThreadDumpHandler | QUERY | admin |
PropertiesRequestHandler | QUERY | admin |
LogginHandler | QUERY, UPDATE (or * ) | admin |
ShowFileRequestHandler | QUERY | admin |
depending on the type of Sentry deployment, be it the newer Sentry service or the older policy file. The
first, and preferred, method of administering policy is by using SQL commands.
Security administrators who are accustomed to managing roles and permissions in popular relational
database systems will find the SQL syntax for administering Sentry policies to be very familiar. Table 7-11
shows all of the statements available to an administrator managing Sentry policies.
Table 7-11. Sentry policy SQL syntax
Statement Description
CREATE ROLE role_name Creates a role with the specified name
DROP ROLE role_name Deletes a role with the specified name
GRANT ROLE role_name TO GROUP group_name Grants the specified role to the specified group
REVOKE ROLE role_name FROM GROUP
group_name
Revokes the specified role from the specified group
GRANT privilege ON object TO ROLE
role_name
Grants a privilege on an object to the specified role
GRANT privilege ON object TO ROLE
role_name WITH GRANT OPTION
Grants a privilege on an object to the specified role and allows the role to
further grant privileges within the object
REVOKE privilege ON object FROM ROLE
role_name
Revokes a privilege on an object from the specified role
SET ROLE role_name Sets the specified role for the current session
SET ROLE ALL Enables all roles (that the user has access to) for the current session
SET ROLE NONE Disables all roles for the current session
SHOW ROLES Lists all roles in the database
SHOW CURRENT ROLES Shows all the roles enabled for the current session
SHOW ROLE GRANT GROUP group_name Shows all roles for the specified group
SHOW GRANT ROLE role_name Shows all grant permissions for the specified role
SHOW GRANT ROLE role_name ON object
object_name
Shows all grant permissions for the specified role on the specified object.
Table 7-11 provides a good listing of the various syntaxes, a working example is warranted to see these in
action (Example 7-14).
Example 7-14. Sentry SQL usage example
# Authenticated as the hive user, which is a member of a group listed in
# sentry.service.admin.group and accessing HiveServer2
# via the beeline CLI
# Create the role for hive administrators
0: jdbc:hive2://server1.example.com:100> CREATE ROLE hive_admin;
No rows affected (0.852 seconds)
# Grant the hive administrator role to the sqladmin group
0: jdbc:hive2://server1.example.com:100> GRANT ROLE hive_admin TO GROUP sqladmin;
No rows affected (0.305 seconds)
# Grant server-wide permissions to the hive_admin role
0: jdbc:hive2://server1.example.com:100> GRANT ALL ON SERVER server1
TO ROLE hive_admin;
No rows affected (0.339 seconds)
# Show all of the roles in the Sentry database
0: jdbc:hive2://server1.example.com:100> SHOW ROLES;
+-------------+
| role |
+-------------+
| hive_admin |
+-------------+
1 row selected (0.63 seconds)
# Show all the privileges that the hive_admin role has access to
# (some columns omitted for brevity)
0: jdbc:hive2://server1.example.com:100> SHOW GRANT ROLE hive_admin;
+-----------+-----------------+-----------------+------------+---------------+
| database | principal_name | principal_type | privilege | grant_option |
+-----------+-----------------+-----------------+------------+---------------+
| * | hive_admin | ROLE | * | false |
+-----------+-----------------+-----------------+------------+---------------+
+----------+
| grantor |
+----------+
| hive |
+----------+
1 row selected (0.5 seconds)
# Show all the roles that the sqladmin is a part of
0: jdbc:hive2://server1.example.com:100> SHOW ROLE GRANT GROUP sqladmin;
+-------------+---------------+-------------+----------+
| role | grant_option | grant_time | grantor |
+-------------+---------------+-------------+----------+
| hive_admin | false | | hive |
+-------------+---------------+-------------+----------+
1 row selected (0.5 seconds)
# Remove all of the roles for the current user session
0: jdbc:hive2://server1.example.com:100> SET ROLE NONE;
No rows affected (0.2 seconds)
# Show list of current roles
0: jdbc:hive2://server1.example.com> SHOW CURRENT ROLES;
+-------+
| role |
+-------+
+-------+
No rows selected (0.305 seconds)
# Verify that no roles yields no access
0: jdbc:hive2://server1.example.com:100> SHOW TABLES;
+-----------+
| tab_name |
+-----------+
+-----------+
0: jdbc:hive2://server1.example.com:100> SELECT COUNT(*) FROM sample_07;
Error: Error while compiling statement:
FAILED: SemanticException No valid privileges (state=42000,code=40000)
# Set the current role to the hive_admin role
0: jdbc:hive2://server1.example.com:100> SET ROLE hive_admin;
No rows affected (0.176 seconds)
# Show list of current roles
0: jdbc:hive2://server1.example.com:100> SHOW CURRENT ROLES;
+-------------+
| role |
+-------------+
| hive_admin |
+-------------+
row selected (0.404 seconds)
# Execute commands that are permitted
0: jdbc:hive2://server1.example.com:100> SHOW TABLES;
+------------+
| tab_name |
+------------+
| sample_07 |
| sample_08 |
+------------+
rows selected (0.536 seconds)
0: jdbc:hive2://server1.example.com:100> SELECT COUNT(*) FROM sample_07;
+------+
| _c0 |
+------+
| 823 |
+------+
row selected (20.811 seconds)
TIP
Using WITH GRANT OPTION is a great way to ease the administration burden on a global SQL administrator. A common example
is to create a Hive database for a given line of business and delegate administrative privileges to a database-specific admin role.
This gives the line of business the flexibility to manage privileges to their own data. To determine if a role has this option, use SHOW
GRANT ROLE role_name and look at the column grant_option.
In Example 7-14, the commands are executed using the beeline CLI for HiveServer2, but they can also be
run from within the impala-shell. Both components utilize the same Sentry service and thus the same
Sentry policies, so changes made from one component are immediately reflected in the other. Sentry
authorization decisions are not cached by the individual components because of the security ramifications of
doing that.
For Sentry deployments that utilize the Sentry service for SQL components, policy administration is
familiar and straightforward. This is not the case with the legacy policy file–based implementation. Sentry-
enabled components need to have read access to the policy file. When using a policy file for Hive and
Impala, this can be achieved by making the file group owned by the hive group and ensuring that both hive
and impala users are members of this group. The policy file itself can be located either on the local system
or in HDFS. For the former, the file needs to exist wherever the component that is making the authorization
decision is deployed. For example, when Sentry is enabled for Hive, the local policy file needs to be on the
machine where the HiveServer2 daemon is running.
TIP
It is highly recommended to specify a location in HDFS for the policy file in order to leverage HDFS replication for redundancy and
availability. Because the policy file is read for every single user operation, it makes sense to increase the replication factor of the file
so components reading it can retrieve it from many different nodes. This can be done with hdfs dfs -setrep N
/user/hive/sentry/sentry-provider.ini, where N is the number of replicas desired. The policy file is small, so it is
perfectly reasonable to set the number of replicas to the number of DataNodes in the cluster.
The format of the policy file follows a typical INI file format with configuration sections identified with
braces and individual configurations specified as KEY = VALUE pairs. Example 7-15 shows a sample
policy file for Sentry when used with Hive and Impala.
Example 7-15. SQL sentry-provider.ini
[databases]
product = hdfs://nameservice1/user/hive/sentry/product.ini
[groups]
admins = admin_role, tmp_access
analysts = analyst_role, tmp_access
developers = developer_role, tmp_access
etl = etl_role, tmp_access
[roles]
# uri accesses
tmp_access = server=server1->uri=hdfs://nameservice1/tmp
# default database accesses
analyst_role = server=server1->db=default->table=*->action=select
developer_role = server=server1->db=default
etl_role = server=server1->db=default->table=*->action=insert, \
server=server1->db=default->table=*->action=select
# administrative role
admin_role = server=server1
The policy file in Example 7-15 has a lot going on and it might not be immediately apparent what it is
defining. The first section of the policy file is databases. This section lists all of the databases and the
corresponding policy files to be used to secure access to them. Having separate configuration files for each
database is certainly not required. However, separating out the configuration files provides the following
benefits:
Allows for version tracking to easily tell which database was affected by a policy change and when
Allows for delegated administrative control at a per-database level of granularity
A misconfiguration of a given database policy file does not affect the master sentry-provider.ini or other
database policy files
Easily disable access for an entire database simply by changing permissions of the policy file in HDFS,
which does not require a change to the policy file itself
The second section of the policy file is groups. This section provides a mapping between groups and the
rolesto which they are assigned. The syntax for this section is group = role. The group, as discussed
earlier, comes from one of two places: the group names according to Hadoop, or locally configured groups
specifically for Sentry. In Example 7-15, no locally configured groups are defined because the earlier
sentry-site.xml in Example 7-1 is configured with HadoopGroupResourceAuthorizationProvider. A
few important facts about the groups section of the policy file:
A given group can be assigned to many roles, separated by commas
Entries in the groups section are read top-down, thus making duplicate entries overwrite any previous
entries for the same group
Names of groups are global in scope, regardless of whether they are defined locally or provided by
Hadoop
Names of roles are local in scope in that the name of a role assigned to a group only applies to the file in
which it is configured
The last section of the policy file is roles. This is where the meat of the policy is defined. The role
configuration syntax is role = permission. The permission portion of the configuration looks a little odd
in that it also has key=value syntax, but with arrows between each set of key/value configurations to
indicate a more granular permission being defined. In general, the shorter the permission string, the greater
the permissions. This is evidenced by the admin_role permission definition. This role is granted complete
access to do anything at the server level. The next granular level of access is the database, or db level. The
developer_role permission definition grants complete access to do anything with the default database.
After that, the next level of access is at the table level. The example shows another feature of the policy file
in that it supports a wildcard option to represent “any” table.
Wildcards are only valid to represent everything. They cannot be used in a leading or trailing fashion to
reference any table with a partial name match. This might seem like a limiting or inconvenient
implementation of wildcards, but keep in mind that these are security policies. Partial name matching with
wildcards opens the door to accidental granting of privileges to unauthorized users. Imagine a scenario
where access to any table starting with “pa” was being granted to a group of developers that are located in
Pennsylvania, but later, users from human resources start using the cluster and create a table called
“payroll” containing information about the pay stubs for all employees in the company. Now the group of
developers in Pennsylvania have unintended access to confidential information. Be very careful with
wildcards and security.
Still within the context of the roles definition at the highest level of granularity for permissions is the action
portion. In the context of table objects in Hive and Impala, the only supported actions are select and
insert. These actions are completely mutually exclusive. If a role is intended to be granted both select and
insert on a table, both permissions are necessary. As with the groups section, multiple permissions can be
given to a role. To do so, simply separate them by a comma. A backslash character can be used to carry
over a list of permission definitions for ease of readability, as shown for the etl_role in Example 7-15.
The last part of the roles section to discuss is the notion of URIs. Example 7-15 shows a URI permission for
the tmp_access role. This permission allows users to do two things: create external tables to data in this
location, and export data from other tables they have access to into this location.
WARNING
URI accesses by default can only be specified in sentry-provider.ini and not in per-database policy files. The reason for this
restriction is the case where a separate administrator maintains a database policy file but does not administer any others. If this
administrator were able to define URI access in the policy file they control, they could grant themselves or anyone else access to
any location in HDFS that is readable by the hive user by using external tables. If this behavior needs to be overridden, the Java
configuration option sentry.allow.uri.db.policyfile=true needs to be set for HiveServer2. This configuration should only
be used if all administrators have equal access to change all Sentry policy files.
While Hive and Impala can now leverage a Sentry service and administer policies using SQL syntax, Solr
has not yet migrated away from using policy files. The policy file format is similar to the SQL counterpart,
with a few changes. Solr authorization operates on collections instead of databases and tables like SQL
components do. Also, Solr privileges do not have SELECT and INSERT, but instead use Query and Update.
Solr privileges can also be All, denoted by an asterisk (*).
Example 7-16 shows a similar layout to the SQL example. In the groups section, groups are assigned roles;
and in the roles section, roles are assigned privileges. The analyst_role provides access to query the
customer_logs collection, the etl_role provides access to update it, and finally the developer_role
has full access to it. Lastly, the admin_role has full privileges to the admin collection.
Example 7-16. Solr sentry-provider.ini
[groups]
admins = admin_role
analysts = analyst_role
developers = developer_role
etl = etl_role
[roles]
analyst_role = collection=customer_logs->action=Query
developer_role = collection=customer_logs->action=*
etl_role = collection=customer_logs->action=Update
# administrative role
admin_role = collection=admin->action=*
It is important to point out that while SQL policy files allow for separate policy files per database, Solr
does not. This means that Solr policy administrators need to be extra careful when modifying the policies
because, as with the SQL policy files, a syntax error invalidates the entire policy file, thus inadvertently
denying access to everyone. A nice feature to help combat typos and mistakes is to validate the policy file
using the config-tool, which leads us into the next section.
Policy File Verification and Validation
When Sentry was first architected to use plain-text policy files, it was immediately apparent that
administrators would need some kind of validation tool to perform basic sanity checks on the file prior to
putting it in place. Sentry ships with a binary file, named sentry (surprise, surprise), which provides an
important feature for policy file implementations: the config-tool command. This command allows an
administrator to check the policy file for errors, but it also provides a mechanism to verify privileges for a
given user. Example 7-17 demonstrates validating a policy file, where the first policy file has no errors and
the second policy file has a typo (the word “sever” instead of “server”).
Example 7-17. Sentry config-tool validation
[root@server1 ~]# sentry --hive-config /etc/hive/conf --command config-tool
-s file:///etc/sentry/sentry-site.xml -i file:///etc/sentry/sentry-provider.ini -v
Using hive-conf-dir /etc/hive/conf
Configuration:
Sentry package jar: file:/var/lib/sentry/sentry-binding-hive-1.4.0.jar
Hive config: file:/etc/hive/conf/hive-site.xml
Sentry config: file:/etc/sentry/sentry-site.xml
Sentry Policy: file:///etc/sentry/sentry-provider.ini
Sentry server: server1
No errors found in the policy file
[root@server1 ~]# sentry --hive-config /etc/hive/conf --command config-tool
-s file:///etc/sentry/sentry-site.xml -i file:///etc/sentry/sentry-provider2.ini -v
Using hive-conf-dir /etc/hive/conf
Configuration:
Sentry package jar: file:/var/lib/sentry/sentry-binding-hive-1.4.0.jar
Hive config: file:/etc/hive/conf/hive-site.xml
Sentry config: file:/etc/sentry/sentry-site.xml
Sentry Policy: file:///etc/sentry/sentry-provider2.ini
Sentry server: server1
*** Found configuration problems ***
ERROR: Error processing file file:/etc/sentry/sentry-provider2.ini
No authorizable found for sever=server1
ERROR: Failed to process global policy file
file:/etc/sentry/sentry-provider2.ini
Sentry tool reported Errors:
org.apache.sentry.core.common.SentryConfigurationException:
at org.apache.sentry.provider.file.SimpleFileProviderBackend.
validatePolicy(SimpleFileProviderBackend.java:198)
at org.apache.sentry.policy.db.SimpleDBPolicyEngine.
validatePolicy(SimpleDBPolicyEngine.java:87)
at org.apache.sentry.provider.common.ResourceAuthorizationProvider.
validateResource(ResourceAuthorizationProvider.java:170)
at org.apache.sentry.binding.hive.authz.SentryConfigTool.
validatePolicy(SentryConfigTool.java:247)
at org.apache.sentry.binding.hive.authz.
SentryConfigTool$CommandImpl.run(SentryConfigTool.java:638)
at org.apache.sentry.SentryMain.main(SentryMain.java:94)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.
invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.
invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212
[root@server1 ~]#
Verifying a user’s privileges is another powerful feature offered by the config-tool. This can be done
both by listing all privileges for a given user, or can be more specific by testing whether a given user would
be authorized to execute a certain query. Example 7-18 demonstrates the usage of these features.
Example 7-18. Sentry config-tool Verification
[root@server1 ~]# sentry --hive-config /etc/hive/conf --command config-tool \
-s file:///etc/sentry/sentry-site.xml \
-i file:///etc/sentry/sentry-provider.ini -l -u bob
Using hive-conf-dir /etc/hive/conf
Configuration:
Sentry package jar: file:/var/lib/sentry/sentry-binding-hive-1.4.0.jar
Hive config: file:/etc/hive/conf/hive-site.xml
Sentry config: file:/etc/sentry/sentry-site.xml
Sentry Policy: file:///etc/sentry/sentry-provider.ini
Sentry server: server1
Available privileges for user bob:
server=server1
server=server1->uri=hdfs://server1.example.com:8020/tmp
[root@server1 ~]# sentry --hive-config /etc/hive/conf --command config-tool \
-s file:///etc/sentry/sentry-site.xml \
-i file:///etc/sentry/sentry-provider.ini -l -u alice
Using hive-conf-dir /etc/hive/conf
Configuration:
Sentry package jar: file:/var/lib/sentry/sentry-binding-hive-1.4.0.jar
Hive config: file:/etc/hive/conf/hive-site.xml
Sentry config: file:/etc/sentry/sentry-site.xml
Sentry Policy: file:///etc/sentry/sentry-provider.ini
Sentry server: server1
Available privileges for user alice:
*** No permissions available ***
[root@server1 ~]# sentry --hive-config /etc/hive/conf --command config-tool
-s file:///etc/sentry/sentry-site.xml -i file:///etc/sentry/sentry-provider.ini
-u bob -e "select * from sample_08"
Using hive-conf-dir /etc/hive/conf
Configuration:
Sentry package jar: file:/var/lib/sentry/sentry-binding-hive-1.4.0.jar
Hive config: file:/etc/hive/conf/hive-site.xml
Sentry config: file:/etc/sentry/sentry-site.xml
Sentry Policy: file:///etc/sentry/sentry-provider.ini
Sentry server: server1
User bob has privileges to run the query
[root@server1 ~]# sentry --hive-config /etc/hive/conf --command config-tool
-s file:///etc/sentry/sentry-site.xml -i file:///etc/sentry/sentry-provider.ini
-u alice -e "select * from sample_08"
Using hive-conf-dir /etc/hive/conf
Configuration:
Sentry package jar: file:/var/lib/sentry/sentry-binding-hive-1.4.0.jar
Hive config: file:/etc/hive/conf/hive-site.xml
Sentry config: file:/etc/sentry/sentry-site.xml
Sentry Policy: file:///etc/sentry/sentry-provider.ini
Sentry server: server1
FAILED: SemanticException No valid privileges
*** Missing privileges for user alice:
server=server1->db=default->table=sample_08->action=select
User alice does NOT have privileges to run the query
Sentry tool reported Errors: Compilation error: FAILED:
SemanticException No valid privileges
[root@server1 ~]#
When the Sentry service was added to the project, a useful migration tool was also included. This tool
allows an administrator to import the policies from the existing file into the Sentry service backend
database. This alleviated the pains of needing to derive the SQL syntax for every policy and manually
adding them to the database. The migration tool is a feature enhancement to the config-tool covered in the
last section. Example 7-19 demonstrates the usage.
Example 7-19. Sentry Policy Import Tool
[root@server1 ~]# sentry --command config-tool --import \
-i file:///etc/sentry/sentry-provider.ini
Using hive-conf-dir /etc/hive/conf/
Configuration:
Sentry package jar: file:/var/lib/sentry/sentry-binding-hive-1.4.0.jar
Hive config: file:/etc/hive/conf/hive-site.xml
Sentry config: file:///etc/sentry/sentry-site.xml
Sentry Policy: file:///etc/sentry/sentry-provider.ini
Sentry server: server1
CREATE ROLE analyst_role;
GRANT ROLE analyst_role TO GROUP analysts;
# server=server1
GRANT SELECT ON DATABASE default TO ROLE analyst_role;
CREATE ROLE admin_role;
CREATE ROLE developer_role;
CREATE ROLE etl_role;
GRANT ROLE admin_role TO GROUP admins;
GRANT ALL ON SERVER server1 TO ROLE admin_role;
GRANT ROLE developer_role TO GROUP developers;
# server=server1
GRANT ALL ON DATABASE default TO ROLE developer_role;
GRANT ROLE etl_role TO GROUP etl;
# server=server1
GRANT INSERT ON DATABASE default TO ROLE etl_role;
# server=server1
GRANT SELECT ON DATABASE default TO ROLE etl_role;
[root@server1 ~]#
Now that we have wrapped up the extensive topics of authentication and authorization, it is time to look at
accounting to make sense of user activity in the cluster.
Active auditing
Passive auditing
Security compliance
bank account numbers, and sensitive information about the business, like
payroll records and business financials.
Hadoop components handle accounting differently depending on the purpose
of the component. Components such as HDFS and HBase are data storage
systems, so auditable events focus on reading, writing, and accessing data.
Conversely, components such as MapReduce, Hive, and Impala are query
engines and processing frameworks, so auditable events focus on end-user
queries and jobs. The following subsections dig deeper into each component,
and describe typical interactions with the component from an accounting
point of view.
HDFS provides two different audit logs that are used for two different
purposes. The first, hdfs-audit.log, is used to audit general user activity such
as when a user creates a new file, changes permissions of a file, requests a
directory listing, and so on. The second, SecurityAuth-hdfs.audit, is used to
audit service-level authorization activity. The setup for these logfiles
involves hooking into log4j.category.SecurityLogger and
log4j.additivity.org.apache.hadoop.hdfs.server.namenode.FSNam
esystem.audit. Example 8-1 shows how to do it.
Example 8-1. HDFS log4j.properties
# other logging settings omitted
hdfs.audit.logger=${log.threshold},RFAAUDIT
hdfs.audit.log.maxfilesize=256MB
hdfs.audit.log.maxbackupindex=20
log4j.logger.org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit=
${hdfs.audit.logger}
log4j.additivity.org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit=f
alse
log4j.appender.RFAAUDIT=org.apache.log4j.RollingFileAppender
log4j.appender.RFAAUDIT.File=${log.dir}/hdfs-audit.log
log4j.appender.RFAAUDIT.layout=org.apache.log4j.PatternLayout
log4j.appender.RFAAUDIT.layout.ConversionPattern=%d{ISO8601} %p %c{2}: %m%n
log4j.appender.RFAAUDIT.MaxFileSize=${hdfs.audit.log.maxfilesize}
log4j.appender.RFAAUDIT.MaxBackupIndex=${hdfs.audit.log.maxbackupindex}
hadoop.security.logger=INFO,RFAS
hadoop.security.log.maxfilesize=256MB
hadoop.security.log.maxbackupindex=20
log4j.category.SecurityLogger=${hadoop.security.logger}
log4j.additivity.SecurityLogger=false
hadoop.security.log.file=SecurityAuth-${user.name}.audit
log4j.appender.RFAS=org.apache.log4j.RollingFileAppender
log4j.appender.RFAS.File=${log.dir}/${hadoop.security.log.file}
log4j.appender.RFAS.layout=org.apache.log4j.PatternLayout
log4j.appender.RFAS.layout.ConversionPattern=%d{ISO8601} %p %c: %m%n
log4j.appender.RFAS.MaxFileSize=${hadoop.security.log.maxfilesize}
log4j.appender.RFAS.MaxBackupIndex=${hadoop.security.log.maxbackupindex}
So what actually shows up when an auditable event occurs? For this set of
examples, let’s assume the following:
The user Alice is identified by the Kerberos principal
alice@EXAMPLE.COM, and she has successfully used kinit to receive a
valid TGT
She does a directory listing on her HDFS home directory
She creates an empty file named test in her HDFS home directory
She changes the permissions of this file to be world-writable
She attempts to move the file out of her home directory and into the /user
directory
In Example 8-2, Alice has done several actions with HDFS that are typical
operations in HDFS. These are user activity events, so let’s inspect hdfs-to see the trail that Alice left behind from her HDFS actions (the
example logfile has been formatted for readability).
...
2014-03-11 23:50:18,251 INFO FSNamesystem.audit: allowed=true
ugi=alice@EXAMPLE.COM
(auth:KERBEROS) ip=/10.1.1.1 cmd=getfileinfo src=/user/alice dst=null
perm=null
2014-03-11 23:50:18,280 INFO FSNamesystem.audit: allowed=true
ugi=alice@EXAMPLE.COM
(auth:KERBEROS) ip=/10.1.1.1 cmd=listStatus src=/user/alice dst=null
perm=null
2014-03-11 23:50:32,058 INFO FSNamesystem.audit: allowed=true
ugi=alice@EXAMPLE.COM
(auth:KERBEROS) ip=/10.1.1.1 cmd=getfileinfo src=/user/alice/test dst=null
perm=null
2014-03-11 23:50:32,073 INFO FSNamesystem.audit: allowed=true
ugi=alice@EXAMPLE.COM
(auth:KERBEROS) ip=/10.1.1.1 cmd=getfileinfo src=/user/alice dst=null
perm=null
2014-03-11 23:50:32,096 INFO FSNamesystem.audit: allowed=true
ugi=alice@EXAMPLE.COM
(auth:KERBEROS) ip=/10.1.1.1 cmd=create src=/user/alice/test dst=null
perm=alice:alice:rw-r-----
2014-03-11 23:50:39,558 INFO FSNamesystem.audit: allowed=true
ugi=alice@EXAMPLE.COM
(auth:KERBEROS) ip=/10.1.1.1 cmd=getfileinfo src=/user/alice/test dst=null
perm=null
2014-03-11 23:50:39,587 INFO FSNamesystem.audit: allowed=true
ugi=alice@EXAMPLE.COM
(auth:KERBEROS) ip=/10.1.1.1 cmd=setPermission src=/user/alice/test
dst=null
perm=alice:alice:rw-rw-rw-
2014-03-11 23:50:47,157 INFO FSNamesystem.audit: allowed=true
ugi=alice@EXAMPLE.COM
(auth:KERBEROS) ip=/10.1.1.1 cmd=getfileinfo src=/user/alice/test dst=null
perm=null
2014-03-11 23:50:47,187 INFO FSNamesystem.audit: allowed=true
ugi=alice@EXAMPLE.COM
(auth:KERBEROS) ip=/10.1.1.1 cmd=getfileinfo src=/user/test dst=null
perm=null
2014-03-11 23:50:47,190 INFO FSNamesystem.audit: allowed=false
ugi=alice@EXAMPLE.COM
(auth:KERBEROS) ip=/10.1.1.1 cmd=rename src=/user/alice/test dst=/user/test
perm=nul
...
As you can see, the audit log shows pertinent information for each action
Alice performed. Every action she performed required a getfileinfo
command first, followed by the various actions she performed (listStatus,
create, setPermission, and rename). In this log, it is clear who the user is that
the event was for, what time it occurred, the IP address that action was
performed from, and various other bits of information. The other important
bit that was recorded was that Alice’s last attempted action to move the file
out of her home directory into a location she did not have permissions for
was not allowed.
MapReduce follows a very similar approach to auditing in that it contains
two audit logs with very similar purposes as the HDFS audit logs. The first
logfile, mapred-audit.log, is used to audit user activity such as job
submissions. The second logfile, SecurityAuth-mapred.audit, is used to
audit service-level authorization activity just like the HDFS log equivalent.
The log4j properties need to be set for these files. The hooks used to set
these up are log4j.category.SecurityLogger and
log4j.logger.org.apache.hadoop.mapred.AuditLogger, and
Example 8-3 shows how to do it.
Example 8-3. MapReduce log4j.properties
# other logging settings omitted
hadoop.security.logger=INFO,RFAS
hadoop.security.log.maxfilesize=256MB
hadoop.security.log.maxbackupindex=20
log4j.category.SecurityLogger=${hadoop.security.logger}
log4j.additivity.SecurityLogger=false
hadoop.security.log.file=SecurityAuth-${user.name}.audit
log4j.appender.RFAS=org.apache.log4j.RollingFileAppender
log4j.appender.RFAS.File=${log.dir}/${hadoop.security.log.file}
log4j.appender.RFAS.layout=org.apache.log4j.PatternLayout
log4j.appender.RFAS.layout.ConversionPattern=%d{ISO8601} %p %c: %m%n
log4j.appender.RFAS.MaxFileSize=${hadoop.security.log.maxfilesize}
log4j.appender.RFAS.MaxBackupIndex=${hadoop.security.log.maxbackupindex}
mapred.audit.logger=${log.threshold},RFAAUDIT
mapred.audit.log.maxfilesize=256MB
mapred.audit.log.maxbackupindex=20
log4j.logger.org.apache.hadoop.mapred.AuditLogger=${mapred.audit.logger}
log4j.additivity.org.apache.hadoop.mapred.AuditLogger=false
log4j.appender.RFAAUDIT=org.apache.log4j.RollingFileAppender
log4j.appender.RFAAUDIT.File=${log.dir}/mapred-audit.log
log4j.appender.RFAAUDIT.layout=org.apache.log4j.PatternLayout
log4j.appender.RFAAUDIT.layout.ConversionPattern=%d{ISO8601} %p %c{2}: %m%n
log4j.appender.RFAAUDIT.MaxFileSize=${mapred.audit.log.maxfilesize}
log4j.appender.RFAAUDIT.MaxBackupIndex=${mapred.audit.log.maxbackupindex}
For this example, let’s assume the following:
The user Bob is identified by the Kerberos principal bob@EXAMPLE.COM,
and he has already successfully used kinit to receive a valid TGT
MapReduce service-level authorizations are not being used
Bob submits a MapReduce job
Bob kills the MapReduce job before it finishes
The result of these actions according to the logs are shown in Examples 8-4
and 8-5.
...
2014-03-12 18:11:46,363 INFO mapred.AuditLogger: USER=bob IP=10.1.1.1
OPERATION=SUBMIT_JOB TARGET=job_201403112320_0001 RESULT=SUCCESS
...
Example 8-5. SecurityAuth-mapred.audit
...
2014-03-12 18:46:25,200 INFO SecurityLogger.org.apache.hadoop.ipc.Server:
Auth successful for bob@EXAMPLE.COM (auth:SIMPLE)
2014-03-12 18:46:25,239 INFO SecurityLogger.org.apache.hadoop.security.
authorize.ServiceAuthorizationManager: Authorization successful for
bob@EXAMPLE.COM (auth:KERBEROS) for protocol=interface
org.apache.hadoop.mapred.JobSubmissionProtocol
2014-03-12 18:46:29,955 INFO SecurityLogger.org.apache.hadoop.ipc.Server:
Auth successful for job_201403112320_0002 (auth:SIMPLE)
2014-03-12 18:46:29,976 INFO SecurityLogger.org.apache.hadoop.security.
authorize.ServiceAuthorizationManager: Authorization successful for
job_201403112320_0002 (auth:TOKEN) for protocol=interface
org.apache.hadoop.mapred.TaskUmbilicalProtocol
...(more)...
2014-03-12 18:47:11,598 INFO SecurityLogger.org.apache.hadoop.ipc.Server:
Auth successful for bob@EXAMPLE.COM (auth:SIMPLE)
2014-03-12 18:47:11,638 INFO SecurityLogger.org.apache.hadoop.security.
authorize.ServiceAuthorizationManager: Authorization successful for
bob@EXAMPLE.COM (auth:KERBEROS) for protocol=interface
org.apache.hadoop.mapred.JobSubmissionProtocol
...
Example 8-4 is pretty straightforward: the user Bob performed the operation
SUBMIT_JOB, which results in a MapReduce job ID of
job_201403112320_0001. Other pertinent info, as one would expect, is the
date and time of the event, and the IP address. In Example 8-5, things look a
little different. The first entry shows that Bob successfully authenticates to
the JobTracker, whereas the second entry shows that Bob has been authorized
to submit the job. The next two events (and subsequent identical events that
YARN audit log events are interspersed among the daemon logfiles.
However, they are easily identifiable because the class name is logged in the
event. For the Resource Manager, it is
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger;
and for the Node Manager, it is
org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger. These
class names can be used to parse out audit events among normal application
log events. For YARN to log audit events, the log4j properties need to be
set. The hook to set this up is the log4j.category.SecurityLogger and
Example 8-6 shows how to do it.
Example 8-6. YARN log4j.properties
# other logging settings omitted
hadoop.security.logger=INFO,RFAS
hadoop.security.log.maxfilesize=256MB
hadoop.security.log.maxbackupindex=20
log4j.category.SecurityLogger=${hadoop.security.logger}
log4j.additivity.SecurityLogger=false
hadoop.security.log.file=SecurityAuth-${user.name}.audit
log4j.appender.RFAS=org.apache.log4j.RollingFileAppender
log4j.appender.RFAS.File=${log.dir}/${hadoop.security.log.file}
log4j.appender.RFAS.layout=org.apache.log4j.PatternLayout
log4j.appender.RFAS.layout.ConversionPattern=%d{ISO8601} %p %c: %m%n
log4j.appender.RFAS.MaxFileSize=${hadoop.security.log.maxfilesize}
log4j.appender.RFAS.MaxBackupIndex=${hadoop.security.log.maxbackupindex}
For this example, the user Alice submits a MapReduce job via YARN, which
then runs to completion. Example 8-7 shows just the audit events for the
Resource Manager and Example 8-8 shows the audit events for one of the
NodeManagers. Note that the repeating auditing class names have been
omitted for brevity, and the events have been formatted for readability.
Example 8-7. YARN Resource Manager Audit Events
2014-12-27 12:49:35,182 INFO USER=alice IP=10.6.9.73
OPERATION=Submit Application Request
TARGET=ClientRMService RESULT=SUCCESS
APPID=application_1419453547005_0001
2014-12-27 12:49:43,598 INFO USER=alice OPERATION=AM Allocated Container
TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1419453547005_0001
CONTAINERID=container_1419453547005_0001_01_000001
2014-12-27 12:49:57,288 INFO USER=alice IP=10.6.9.75 OPERATION=Register App
Master
TARGET=ApplicationMasterService RESULT=SUCCESS
APPID=application_1419453547005_0001
APPATTEMPTID=appattempt_1419453547005_0001_000001
2014-12-27 12:50:02,375 INFO USER=alice OPERATION=AM Allocated Container
TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1419453547005_0001
CONTAINERID=container_1419453547005_0001_01_000002
2014-12-27 12:50:02,376 INFO USER=alice OPERATION=AM Allocated Container
TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1419453547005_0001
CONTAINERID=container_1419453547005_0001_01_000003
2014-12-27 12:50:19,361 INFO USER=alice OPERATION=AM Released Container
TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1419453547005_0001
CONTAINERID=container_1419453547005_0001_01_000002
2014-12-27 12:50:21,436 INFO USER=alice OPERATION=AM Released Container
TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1419453547005_0001
CONTAINERID=container_1419453547005_0001_01_000003
2014-12-27 12:50:27,954 INFO USER=alice OPERATION=AM Released Container
TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1419453547005_0001
CONTAINERID=container_1419453547005_0001_01_000001
2014-12-27 12:50:27,963 INFO USER=alice OPERATION=Application Finished -
Succeeded
TARGET=RMAppManager RESULT=SUCCESS APPID=application_1419453547005_0001
Example 8-8. YARN Node Manager Audit Events
2014-12-27 12:49:43,956 INFO USER=alice IP=10.6.9.75
OPERATION=Start Container Request TARGET=ContainerManageImpl
RESULT=SUCCESS
APPID=application_1419453547005_0001
CONTAINERID=container_1419453547005_0001_01_000001
2014-12-27 12:50:27,105 INFO USER=alice OPERATION=Container Finished -
Succeeded
TARGET=ContainerImpl RESULT=SUCCESS APPID=application_1419453547005_0001
CONTAINERID=container_1419453547005_0001_01_000001
2014-12-27 12:50:27,984 INFO USER=alice IP=10.6.9.75 OPERATION=Stop
Container Request
TARGET=ContainerManageImpl RESULT=SUCCESS
APPID=application_1419453547005_0001
CONTAINERID=container_1419453547005_0001_01_000001
One of the many benefits of YARN is the ability to specify resource pools.
As we saw earlier, resource pools can have authorization controls set up
such that only certain users and groups can submit to a given pool. In the next
example, Bob tries to submit to the prod resource pool, but he does not have
authorization to do so. Example 8-9 shows what the audit events look like in
this case. Again, the audit logger class name has been removed for brevity
and the log has been formatted for readability.
Example 8-9. YARN Resource Manager Audit Events
2014-12-27 13:56:35,886 INFO USER=bob IP=10.6.9.73
OPERATION=Submit Application Request TARGET=ClientRMService
RESULT=SUCCESS APPID=application_1419705820412_0002
2014-12-27 13:56:35,917 WARN USER=bob OPERATION=Application Finished -
Failed
TARGET=RMAppManager RESULT=FAILURE DESCRIPTION=App failed with state:
FAILED
PERMISSIONS=User bob cannot submit applications to queue root.prod
APPID=application_1419705820412_0002
Hive auditing is similar to YARN in that it does not have a dedicated audit
logfile. Audit events occur inside the actual Hive Metastore service log so it
can be a bit of a challenge for a security administrator to get at just the
pertinent audit information among regular application log events. As with
YARN, however, the audit logger class names can be used to identify audit
events. Other Hive components, such as HiveServer2, do not have explicit
auditing, but audit-like information can still be gleaned from the service logs.
For this example, let’s assume:
The user Bob is identified by the Kerberos principal bob@EXAMPLE.COM,
and he has already successfully used kinit to receive a valid TGT
Bob is using the beeline CLI to connect to HiveServer2
Bob first executes show tables; to list the tables in the default database
Bob then executes select count(*) from sample_07; to count the
number of records in the sample_07 table
The result of these actions is shown in Example 8-10, which has been
formatted for readability.
Example 8-10. Hive Metastore audit events
...
2014-03-29 17:13:18,778 INFO
org.apache.hadoop.hive.metastore.HiveMetaStore.audit:
ugi=bob ip=/10.1.1.1 cmd=get_database: default
...
2014-03-29 17:13:18,782 INFO
org.apache.hadoop.hive.metastore.HiveMetaStore.audit:
ugi=bob ip=/10.1.1.1 cmd=get_tables: db=default pat=.*
...
2014-03-29 17:13:37,110 INFO
org.apache.hadoop.hive.metastore.HiveMetaStore.audit:
ugi=bob ip=/10.1.1.1 cmd=get_table : db=default tbl=sample_07
Reviewing the audit events in Example 8-10 shows several things. First, the
audit events themselves are tagged with
org.apache.hadoop.hive.metastore.HiveMetaStore.audit. This
makes it a little easier to search the log specifically for audit events. Next,
you will notice a slight difference in these audit events and the audit events
we have seen previously with regard to user identification. With Hive, only
the username is shown instead of the full Kerberos UPN. In each audit event,
the action performed by the user is identified by the cmd field. As you can
see, the show tables; query generates two audit events: get_database and
get_tables. The actual SQL query to count rows generates a single audit
event, which is for get_table. As with previous audit events in other
components, the IP address of the user executing the action is given.
Impala audit events are logged into dedicated audit logs used by each Impala
daemon (impalad). The audit log directory location is specified using the flag
audit_event_log_dir. A typical choice is the directory
/var/log/impalad/audits. These logfiles are rolled after they reach a certain
“size” dictated by a number of lines, as specified using the flag
max_audit_event_log_file_size. A reasonable setting is 5,000 lines.
For the Impala example, we will assume that the exact same assumptions are
made as the Hive example. The results of these actions are shown in
Example 8-11.
Example 8-11. Impala daemon audit log
....
{"1396114935263":{"query_id":"914b9eb1591546f0:ff4419eab4de439c",
"session_id":"e643b5e102f653ec:94e0a3d4b3646ca3",
"start_time":"2014-03-29 17:42:15.201945000","authorization_failure":false,
"status":"","user":"bob","impersonator":null,"statement_type":"SHOW_TABLES",
"network_address":"::ffff:10.1.1.1:47569","sql_statement":"show tables",
"catalog_objects":[]}}
{"1396115148996":{"query_id":"97443eddd3c172fd:34fe3f37c84d6ea8",
"session_id":"e643b5e102f653ec:94e0a3d4b3646ca3",
"start_time":"2014-03-29 17:45:48.850540000","authorization_failure":false,
"status":"","user":"bob","impersonator":null,"statement_type":"QUERY",
"network_address":"::ffff:10.1.1.1:47569","sql_statement":
"select count(*) from sample_07","catalog_objects":
[{"name":"default.sample_07","object_type":"TABLE","privilege":"SELECT"}]}}
....
Reviewing Example 8-11 immediately shows that the audit events are in a
much different format than other Hadoop components. These audit events are
logged in JSON format, which makes it a little more difficult for human-
readability, but allows for easy consumption by an external tool. The first
audit event shows the type of action taken by the user under the
statement_type field, namely SHOW_TABLES. This information is also
available in the sql_statement field, which shows the exact query that Bob
made. The second audit event shows the type of action taken as QUERY.
HBase logs audit events into a separate logfile, which can be configured in
the associated log4j.properties file. HBase architecture is such that clients
contact only the specific server that is responsible for the specific action
taken, so audit events are spread out throughout an HBase cluster. For
example, creating, deleting, and modifying tables is an action that the HBase
Master is responsible for. Operations such as scans, puts, and gets are
specific to a given region in a table, thus a RegionServer captures these
events.
For HBase to log audit events, the log4j properties need to be set. The hook
to set this up is the log4j.logger.SecurityLogger and Example 8-12
shows how to do it.
Example 8-12. HBase log4j.properties
# other logging settings omitted
log4j.logger.SecurityLogger=TRACE, RFAS
log4j.additivity.SecurityLogger=false
log4j.appender.RFAS=org.apache.log4j.RollingFileAppender
log4j.appender.RFAS.File=${log.dir}/audit/SecurityAuth-hbase.audit
log4j.appender.RFAS.layout=org.apache.log4j.PatternLayout
log4j.appender.RFAS.layout.ConversionPattern=%d{ISO8601} %p %c: %m%n
log4j.appender.RFAS.MaxFileSize=${max.log.file.size}
log4j.appender.RFAS.MaxBackupIndex=${max.log.file.backup.index}
For our example here, the following actions are performed:
The HBase superuser creates a table called sample
The HBase superuser grants RW access to user Alice on the sample table
The HBase superuser grants R access to user Bob on the sample table
Alice tries to create a new table called sample2, but is denied access
Alice puts a value into the sample table
Alice scans the sample table
Bob scans the sample table
Bob tries to put a value into the sample table, but is denied access
HBase audits can be narrowed down to a specific class, namely
SecurityLogger.org.apache.hadoop.hbase.security.access.Access
Controller, as with the log events in other components. This class is
repeatedthroughout the logs, but is omitted in Examples 8-13 and 8-14 for
brevity. Also, these examples have been formatted for readability.
Example 8-13. HBase master audit log
2014-12-27 21:05:56,938 TRACE Access allowed for user hbase; reason:
Global check allowed; remote address: /10.6.9.74; request: createTable;
context: (user=hbase@EXAMPLE.COM, scope=sample, family=cf, action=CREATE)
2014-12-27 21:06:09,484 TRACE Access allowed for user hbase; reason:
Table permission granted; remote address: /10.6.9.74;
request: getTableDescriptors; context: (user=hbase@EXAMPLE.COM,
scope=sample, family=, action=ADMIN)
2014-12-27 21:06:16,620 TRACE Access allowed for user hbase; reason:
Table permission granted; remote address: /10.6.9.74;
request: getTableDescriptors; context: (user=hbase@EXAMPLE.COM,
scope=sample, family=, action=ADMIN)
2014-12-27 21:07:02,102 TRACE Access denied for user alice; reason:
Global check failed; remote address: /10.6.9.74; request:
createTable; context: (user=alice@EXAMPLE.COM, scope=sample2,
family=cf, action=CREATE)
What we can see in Example 8-13 is that table creation events are clearly
logged. The hbase user is allowed access, whereas Alice is denied access to
create a table. Also shown in this logfile is that the actual granting of
permissions is not obvious in the log. While the action is logged as an ADMIN
action, and it has the same scope as for the sample table, there is no
indication of which user the table permissions were granted to. This is a
limitation in HBase that will likely be improved in a future release.
Example 8-14. HBase region server audit log
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()

From Example 8-14, the read and write actions attempted by Alice and Bob
are clearly identified. It provides the pertinent information about the table,
column family, and column, as well as the reason for why the action was
allowed or denied.
Accumulo Audit Logs
Similar to HBase, Accumulo can be configured to log audit events to a
separate logfile. Because Accumulo clients aren’t required communicate
with a single, central server for every access, audit logs are spread
throughout the cluster. For example, when you create, delete, or modify a
table, that action will be logged by the Accumulo Master, whereas operations
such as scans and writes are logged by the TabletServer handling the request.
The default Accumulo configuration templates have audit logging turned off.
![]()
![]()
![]()
![]()
![]()
![]()
You can turn on logging by setting the log level of the logger to
in the auditLog.xml log4j configuration file. Example 8-15 shows a sample
auditLog.xml configuration file with audit logging turned on.
Example 8-15. Accumulo auditLog.xml
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()

![]()
![]()
![]()
![]()

![]()
![]()
![]()
![]()
![]()
![]()
Accumulo audits both system administration actions and normal user access.
Every audit includes operation status (success, failure, permitted, or denied)
and the user performing the action. Remote requests also include the client
address. Failed requests log the exception that caused the failure. Individual
actions differ in the details they provide, but generally they include details
such as the target or targets of an action and relevant parameters such as the
range of rows and columns accessed. See Table 8-1 for a list of the actions
that Accumulo logs.
Table 8-1. Accumulo’s audited actions
Action Description
authenticate A user authenticates with Accumulo
createUser An admin creates a new user
dropUser An admin drops a user
changePassword An admin changes a user’s password
changeAuthorizations An admin changes a user’s authorizations
grantSystemPermission An admin grants system permissions to a user
grantTablePermission An admin grants permissions to a user on a table
Action Description
revokeSystemPermission An admin revokes system permissions to a user
revokeTablePermission An admin revokes permissions to a user on a table
createTable A user creates a table
deleteTable A user deletes a table
renameTable A user renames a table
cloneTable A user clones a table
scan A user scans a range of rows
deleteData A user delete’s a table
bulkImport A user initiates a bulk import of data
export A user exports a table from one cluster to another
import A user imports an exported table
Now let’s see what the audit logs will look like after some actions are
performed. Our examples will include the results after running the following
actions:
The Accumulo root user creates a user called alice
The Accumulo root user creates a user called bob
The Accumulo root user creates a table called sample
The Accumulo root user grants Table.READ and Table.WRITE access to
user alice on the sample table
The Accumulo root user grants Table.READ access to user bob on the
sample table
Alice tries to create a new table called sample2, but is denied access
Alice puts a value into the sample table
Alice scans the sample table
Bob scans the sample table
Bob tries to put a value into the sample table, but is denied access
The audit logs shown in Examples 8-16 and 8-17 have been formatted for
readibility but are otherwise unmodified.
Example 8-16. Accumulo master audit log
2014-12-27 16:40:11,673/-0800 [Audit] INFO : operation: permitted;
user: root; action: createTable; targetTable: sample;
2014-12-27 16:40:28,563/-0800 [Audit] INFO : operation: denied;
user: alice; action: createTable; targetTable: sample2;
In Example 8-16, we can see that the table creation operations are clearly
logged. The root user is permitted to perform the createTable action while
alice is denied. The other administrative actions appear in the
TabletServer log.
Example 8-17. Accumulo TabletServer audit log
2014-12-27 16:39:49,262/-0800 [Audit] INFO : operation: success;
user: root: action: createUser; targetUser: alice; Authorizations: ;
2014-12-27 16:40:02,226/-0800 [Audit] INFO : operation: success;
user: root: action: createUser; targetUser: bob; Authorizations: ;
2014-12-27 16:40:13,226/-0800 [Audit] INFO : operation: success;
user: root: action: grantTablePermission; permission: READ;
targetTable: sample; targetUser: alice;
2014-12-27 16:40:13,292/-0800 [Audit] INFO : operation: success;
user: root: action: grantTablePermission; permission: WRITE;
targetTable: sample; targetUser: alice;
2014-12-27 16:40:13,442/-0800 [Audit] INFO : operation: success;
user: root: action: grantTablePermission; permission: READ;
targetTable: sample; targetUser: bob;
2014-12-27 16:40:30,529/-0800 [Audit] INFO : operation: permitted;
user: alice; action: scan; targetTable: sample; authorizations: ;
range: (-inf,+inf); columns: []; iterators: []; iteratorOptions: {};
2014-12-27 16:40:43,180/-0800 [Audit] INFO : operation: permitted;
user: bob; action: scan; targetTable: sample; authorizations: ;
range: (-inf,+inf); columns: []; iterators: []; iteratorOptions: {};
Example 8-17 shows the output of the TabletServer audit log. We can see
the createUser actions, the user that was created, and the authorizations that
were assigned to that user. We can also see the grantTablePermission
actions along with the permission granted, the target table, and the target user.
Finally, we can see that the two scan actions includes the details of the
query: the row range, columns, and iterators used. Notably missing are the
write operations. This is a current gap in Accumulo’s auditing framework.
We also don’t see the authentication events because they are logged by the
shell itself.
In Chapter 7, we saw that the latest version of Sentry uses a service to
facilitate authorization requests and manage interaction with the policy
database. Auditing events that come as a result of modifying authorization
policies is extremely critical in the accounting process. In order to do that,
Sentry needs to be configured to capture audit events.
sentry.hive.authorization.ddl.logger logger class is the one that
needs to be configured. Example 8-18 shows how this can be done.
Example 8-18. Sentry server log4j.properties
# other log settings omitted
log4j.logger.sentry.hive.authorization.ddl.logger=${sentry.audit.logger}
log4j.additivity.sentry.hive.authorization.ddl.logger=false
sentry.audit.logger=TRACE,RFAAUDIT
sentry.audit.log.maxfilesize=256MB
sentry.audit.log.maxbackupindex=20
log4j.appender.RFAAUDIT=org.apache.log4j.RollingFileAppender
log4j.appender.RFAAUDIT.File=${log.dir}/audit/sentry-audit.log
log4j.appender.RFAAUDIT.layout=org.apache.log4j.PatternLayout
log4j.appender.RFAAUDIT.layout.ConversionPattern=%d{ISO8601} %p %c{2}: %m%n
log4j.appender.RFAAUDIT.MaxFileSize=${sentry.audit.log.maxfilesize}
log4j.appender.RFAAUDIT.MaxBackupIndex=${sentry.audit.log.maxbackupindex}
Now that Sentry is set up to log audit events, let’s look at an example. For
this example, Alice is a Sentry administrator and Bob is not. Alice uses the
beeline shell to create a new role called analyst, assign the role to the
group analystgrp, and grant SELECT privileges on the default database to
the role. Next, Bob tries to create a new role using the impala-shell, but is
denied access. Example 8-19 shows the record of these actions.
Example 8-19. Sentry server audit log
2015-01-02 11:17:10,753 INFO ddl.logger:
{"serviceName":"Sentry-Service","userName":"alice","impersonator":
"hive/server1.example.com@EXAMPLE.COM","ipAddress":"/10.6.9.74",
"operation":"CREATE_ROLE","eventTime":"1420215430742","operationText":
"CREATE ROLE analyst","allowed":"true","databaseName":null,
"tableName":null,"resourcePath":null,"objectType":"ROLE"}
2015-01-02 11:17:37,537 INFO ddl.logger:
{"serviceName":"Sentry-Service","userName":"alice","impersonator":
"hive/server1.example.com@EXAMPLE.COM","ipAddress":"/10.6.9.74",
"operation":"ADD_ROLE_TO_GROUP","eventTime":"1420215457536",
"operationText":"GRANT ROLE analyst TO GROUP analystgrp","allowed":"true",
"databaseName":null,"tableName":null,"resourcePath":null,"objectType":"ROLE"
}
2015-01-02 11:17:52,408 INFO ddl.logger:
{"serviceName":"Sentry-Service","userName":"alice","impersonator":
"hive/server1.example.com@EXAMPLE.COM","ipAddress":"/10.6.9.74",
"operation":"GRANT_PRIVILEGE","eventTime":"1420215472407","operationText":
"GRANT SELECT ON DATABASE default TO ROLE analyst","allowed":"true",
"databaseName":"default","tableName":"","resourcePath":"","objectType":"PRIN
CIPAL"}
2015-01-02 11:33:20,199 INFO ddl.logger:
{"serviceName":"Sentry-Service","userName":"bob","impersonator":
"impala/server1.example.com@EXAMPLE.COM","ipAddress":"/10.6.9.73",
"operation":"CREATE_ROLE","eventTime":"1420216400199","operationText":
"CREATE ROLE temp","allowed":"false","databaseName":null,"tableName":null,
"resourcePath":null,"objectType":"ROLE"}
purpose log aggregation systems already in place in the enterprise can be a
great way to manage Hadoop audit logs.
Another interesting option for log aggregation is to ingest them back into the
Hadoop cluster for analysis. Security use cases for Hadoop are common and
analyzing audit events from Hadoop fits the bill as well. As shown in this
chapter, audit events are generally in a structured form and make for easy
querying using SQL tools like Hive or Impala.
In this chapter, we took a look at several of the components in the Hadoop
ecosystem and described the types of audit events that are recorded when
users interact with the cluster. These log events are critical for accounting to
ascertain what regular users are doing, but also to discover what
unauthorized users are attempting to do. Although the Hadoop ecosystem
does not have native alerting capabilities, the structure of the log events are
conducive to allow additional tools to consume the events in a more general
way. Active alerting is a newer capability that is still being worked on in the
Hadoop ecosystem. Still, many general-purpose log aggregation tools
possess the capabilities to alert when certain criteria are met, with many of
these tools being common in the enterprise.
By Eddie Garcia
So far, we have covered how Hadoop can be configured to enforce standard
AAA controls. In this chapter, we will understand how these controls, along
with the CIA principles discussed in Chapter 1, provide the foundation for
protecting data. Data protection is a broad concept that involves topics
ranging from data privacy to acceptable use. One of the topics we will
specifically focus on is encryption.
Encryption is a common method to protect data. There are two primary
flavors of data encryption: data-at-rest encryption and data-in-transit, also referred to as over-the-wire encryption. Data at rest refers
to data that is stored even after machines are powered off. This includes data
on hard drives, flash drives, USB sticks, memory cards, CDs, DVDs, or even
some old floppy drives or tapes in storage boxes. Data in transit, as its name
implies, is data on the move, such as data traveling on the Internet, a USB
cable, a coffee shop WiFi, cell phone towers, or from a remote space station
to Earth.
Before diving into the two flavors of data encryption, we’ll briefly discuss
encryption algorithms. Encryption algorithms define the mathematical
technique used to encrypt data. A common encryption algorithm is the
Advanced Encryption Standard, or AES. It is a specification established by
the U.S. National Institute of Standards and Technology (NIST) in FIPS-197.
Describing how AES encryption works is beyond the scope of this text, and
we recommend Chapter 4 of Understanding Cryptography by Christof Paar
and Jan Palzl (Springer, 2010). Other common encryption algorithms include
DES, RC4, Twofish, and Blowfish.
When using AES, the commonly supported sizes are 128-bit, 192-bit, and
256-bit keys. The industry standard today is AES-256 (256-bit key)
encryption, but history has shown that this can and will change. At one point,
DES and triple DES (three rounds of DES) was the industry standard, but
with today’s computers both can be easily cracked with brute force.
Because of the performance overhead that encryption incurs, chip vendors
created on-hardware functions to improve the performance of encryption.
These enhancements can yield several orders of magnitude of improvement
over software encryption. One popular hardware encryption technology is
Intel’s AES-NI.
NOTE
Over the years, there have been many cases of sensitive data breaches as a result of
laptops and cell phones misplaced during transport, improper hard drive disposal, and
physical hardware theft. Data-at-rest encryption helps mitigate these types of breaches
because encryption makes it more difficult (but not impossible) to view the data.
In addition to native HDFS encryption, we will explore three other options,
but we will not go into depth for every method because some are vendor
specific. These methods work transparently below HDFS and thus don’t
require any Hadoop-specific configuration. All of these methods protect data
in the case of a drive being physically removed from a drive array:
Encrypted drives
Full disk encryption
Filesystem encryption
egg situation; the encrypted OS would need to boot to decrypt the OS.
One of the benefits of filesystem encryption is that it offers protection for
data against rogue users and processes running on the system. If an
encrypted home directory is protected by a password known to a user
and that user has not logged on to the system since boot, it would be
impossible for a rogue user or process to gain access to the key to unlock
the user’s data, even as root.
How do you configure more than one disk partition with encryption?
How can you avoid providing passwords at boot time or in clear text
scripts?
Ultimately, the hard part of large-scale at-rest encryption is key management.
Native HDFS data-at-rest encryption, as we’ll discuss in the next section,
uses a combination of collocating encrypted keys with the file metadata and
reliance on an external key server for managing key material. The other
encryption-at-rest technologies discussed also require the use of a key
management service at scale.
Picking a vendor for your key management system is complicated and we
can’t provide a recommendation for your environment. However, here are
some key criteria to consider:
Does the solution support hardware security modules?
How scalable is the solution (number of keys as well as key retrieval per
second)?
Does the solution support Hadoop standards (e.g., KeyProvider
interface)?
How easy is it to manage authorization controls for hundreds or thousands
of keys?
Starting with Hadoop 2.6, HDFS supports native encryption at rest. This
feature is not considered full disk encryption or filesystem encryption. Rather
it is another variation typically called application-level encryption. In this
method, data is encrypted at the application layer before it is sent in transit
and before it reaches storage. This method of encryption runs above the
operating system layer and no special operating system packages or
hardware are required other than what is provided by Hadoop. For more
details on the design of native HDFS encryption beyond the description given
here, you can read the HDFS Data at Rest Encryption Design Document.
Within HDFS, directory paths that require encryption are broken down into
encryption zones. Each file in an encryption zone is encrypted with a unique
data encryption key (DEK). This is where the encryption zone distinction
matters. The plain text DEKs are not persisted. Instead, a zone-level
encryption key called an encryption zone key (EZK), is used to encrypt the
DEK into an encrypted DEK (EDEK). The EDEK is then persisted as an
extended attribute in the NameNode metadata for a given file.
HDFS encryption zones provide a tool for mirroring external security
domains. Take a company with multiple divisions that need to maintain some
division-only datasets. By creating an encryption zone per division, you can
protect data on a per-division basis without the overhead of keeping a unique
key per file in an authenticated keystore.
If the EDEK is stored in the HDFS metadata, where are the EZKs stored?
These keys need to be kept secure because compromising an EZK provides
access to all data stored in that encryption zone. To prevent Hadoop
administrators from having access to the EZKs, and thus the ability to decrypt
any data, the EZKs must not be stored in HDFS. EZKs need to be accessed
through a secure key server. The key server itself is a separate piece of
software that handles the storage and retrieval of EZKs. In larger enterprises,
the actual storage component is handled by a dedicated hardware security(HSM). With this deployment, the key server acts as the software
interface between the clients requesting keys and the backend secure storage.
In order to have a separation of duties, there needs to be an intermediary
between HDFS, HDFS clients, and the key server. This is solved with the
introduction of the Hadoop Key Management Server (KMS). The KMS
handles generating encryption keys (both EZKs and DEKs), communicating
with the key server, and decrypting EDEKs. The KMS communicates with
the key server through a Java API called the KeyProvider. The KeyProvider
implementation and configuration is covered a bit later.
To better understand what is happening, let’s take a look at the sequence of
events that happens when an HDFS client is writing to a new file that’s
stored in an encryption zone in HDFS:
The HDFS client calls create() to write to the new file.
The NameNode requests the KMS to create a new EDEK using the
EZK-id/version.
The KMS generates a new DEK.
The KMS retrieves the EZK from the key server.
The KMS encrypts the DEK, resulting in the EDEK.
The KMS provides the EDEK to the NameNode.
The NameNode persists the EDEK as an extended attribute for the file
metadata.
The NameNode provides the EDEK to the HDFS client.
The HDFS client provides the EDEK to the KMS, requesting the DEK.
The KMS requests the EZK from the key server.
The KMS decrypts the EDEK using the EZK.
The KMS provides the DEK to the HDFS client.
The HDFS client encrypts data using the DEK.
The HDFS client writes the encrypted data blocks to HDFS.
The sequence of events for reading an encrypted file is:
The HDFS client calls open() to read a file.
The NameNode provides the EDEK to the client.
The HDFS client passes the EDEK and EZK-id/version to the KMS.
The KMS requests the EZK from the key server.
The KMS decrypts the EDEK using the EZK.
The KMS provides the DEK to the HDFS client.
The HDFS client reads the encrypted data blocks, decrypting them with
the DEK.
In both the read and write sequences, HDFS authorization was not
mentioned. Authorization checks still happen before the file can be created or
opened. The encryption/decryption steps only happen after the HDFS
authorization checks.
WARNING
Because the KMS plays such an important role in HDFS encryption, this component
should not be collocated on servers running other Hadoop ecosystem components, or
servers used as edge nodes for clients. There needs to be a proper security separation of
duties, and isolation between encryption key operations and other operations.
Because the communication between the KMS and both the key server and
HDFS clients involves passing encryption keys, it is absolutely paramount
that this communication also be encrypted using TLS. We will see how to do
this in the next section.
Configuration
We have covered a lot in this section about HDFS encryption, but so far we
have not discussed how any of this actually gets configured. In the core-on each HDFS node and client node, set the following parameter:
hadoop.security.key.provider.path
The URI for the KeyProvider to use when interacting with encryption
keys as a client. Example:
kms://https@kms.example.com:16000/kms.
On the HDFS server (NameNode and DataNode) side, the following
properties are available:
dfs.encryption.key.provider.uri
The URI for the KeyProvider to use when interacting with encryption
keys used when reading and writing to an encryption zone. Example:
kms://https@kms.example.com:16000/kms.
hadoop.security.crypto.cipher.suite
Cipher suite for the crypto codec. Default: AES/CTR/NoPadding
hadoop.security.crypto.codec.classes.aes.ctr.nopadding
Comma-separated list of crypto codec implementations for
AES/CTR/NoPadding. The first implementation will be used if
available; others are fallbacks. Default:
org.apache.hadoop.crypto.OpensslAesCtrCryptoCodec,
org.apache.hadoop.crypto.JceAesCtrCryptoCodec
hadoop.security.crypto.jce.provider
The JCE provider. Default: None
hadoop.security.crypto.buffer.size
The buffer size used by CryptoInputStream and
CryptoOutputStream. Default: 8192
As you can see, the HDFS configuration is minimal. HDFS uses sensible
defaults for the cryptography aspect of it, so the only requirement to enable
HDFS encryption is to set the first two configurations, namely
hadoop.security.key.provider.path and
dfs.encryption.key.provider.uri.
In order to configure the Hadoop KMS, the configuration filekms-site.xml is
used. Configure the following properties on the Hadoop KMS node:
hadoop.kms.key.provider.uri
The URI for the EZK provider. Example:
jceks://file@/var/lib/kms/kms.keystore.
hadoop.kms.authentication.type
The authentication mechanism to use. Example: simple or kerberos
hadoop.kms.authentication.kerberos.keytab
The location of the Kerberos keytab file to use for service authentication
hadoop.kms.authentication.kerberos.principal
The SPN that the service should use for authentication. Example:
HTTP/kms.example.com@EXAMPLE.COM.
hadoop.kms.authentication.kerberos.name.rules
Kerberos auth_to_local rules to use. Example: DEFAULT
hadoop.kms.proxyuser.<user>.groups
The list of groups that <user> (e.g., hdfs, hive, oozie) is allowed to
impersonate
hadoop.kms.proxyuser.<user> .hosts++
The list of hosts from which <user> (e.g., hdfs, hive, oozie) is allowed
to impersonate
WARNING
An example is listed in the KMS configuration properties that shows the ability to use a
file-based KeyProvider. This is just a Java keystore file that stores EZKs. While this is a
quick and easy way to get up and running with HDFS encryption, it is only recommended
in POC or development environments for testing. Using a file-based KeyProvider
collocates the KMS and key server functions on the same machine, which does not offer
the desired security separation of duties or the ability to enforce additional isolation
controls. Also, the key storage is just a basic file on disk. As mentioned before, most
enterprises will want to utilize a separate service as the KeyProvider, which uses a more
secure storage for EZKs, such as what is provided by HSMs.
As you can see from the KMS configuration, strong authentication with
Kerberos is possible. This is absolutely the recommended configuration.
Non-Kerberos deployment should not be used due to the sensitivity of what
the KMS is providing. The actual KMS operates over the HTTP protocol, so
Kerberos authentication with KMS clients happens over SPNEGO. For this
reason, the Kerberos principal that the KMS uses should be of the
HTTP/kms.example.com@EXAMPLE.COM variety, which uses the HTTP
service name.
We mentioned briefly in the last section that setting up the KMS with TLS
wire encryption is important. To do this, set two environment variables for
the KeyStore and password in kms-env.sh. The KeyStore file is just a Java
KeyStore and the location of it is specified with the
KMS_SSL_KEYSTORE_FILE environment variable. If this KeyStore is
protected with a password (and it should be!), specify the password in the
KMS_SSL_KEYSTORE_PASS environment variable.
KMS authorization
The KMS, like other Hadoop components, has the ability to restrict access to
certain functions through the use of access control lists (ACLs). The file kms-stores information about which users and groups can perform which
functions with the KMS. Example 9-1 shows an example of one.
![]()
![]()
![]()
![]()

Each of the entries in Example 9-1 have a value format of user1,user2
group1,group2, just like as described in “Service-Level Authorization”.
You’ll notice the usage of blacklists. In order to enforce the separation of
Hadoop administrators from the actual data, Hadoop administrators should
not have the ability to interact and perform operations on the KMS. Hadoop
administratorsthat are part of the supergroup have the ability to traverse the
entire HDFS directory tree. Encrypted data should not be able to be
decrypted by administrators of the cluster, so blacklisting these users is
important.
Keep in mind that the Hadoop KMS is a general-purpose key management
server. The keys it works with have no meaning or difference in how they are
handled. This means that EZKs and DEKs are equivalent from the KMS point
of view. This is why using KMS ACLs is important. For example, by default
the CREATE operation returns the actual key material. This is bad if regular
users are able to retain the actual EZK, as it can be used to decrypt EDEKs
for the entire encryption zone.
WARNING
Allowing any user to create keys opens up several potential security risks. For example, a
rogue user could easily write a script to continually create new keys until the KMS and/or
key server fails, such as running out of storage. This effectively creates a denial-of-service
scenario that prevents all encrypted data from being accessible! Use restrictive KMS
ACLs to authorize only a small set of security administrators the ability to create and
manage keys.
Following this model, Example 9-1 shows that Hadoop administrators,
namely the hdfs user and the supergroup group, are blacklisted from all the
operations that are unnecessary. Furthermore, the infosec group is the only
group allowed to perform the MANAGEMENT and READ functions. Lastly, the
hadoopusers group is allowed to perform the DECRYPT_EEK function, but
nothing else.
While Example 9-1 shows default ACLs, denoted by the prefix
default.key.acl, it is also possible to define ACLs to specific keys by
name, such as key.acl.foo.READ where foo is the name of the key. We’ll
discuss how the keynames come into the picture in the next section, which
covers HDFS encryption client operations.
Client operations
To do this, use the hadoop key command, which outputs details of the key
created and the KMS that performed the request:
[bob@server1 ~]$ hadoop key -create myzonekey
myzonekey has been successfully created with options
Options{cipher='AES/CTR/NoPadding', bitLength=128, description='null',
attr
ibutes=null}.
KMSClientProvider[https://kms.example.com:16000/kms/v1/] has been
updated.
[bob@server1 ~]$
Now we can create a new encryption zone in HDFS. To do this, use the hdfs
crypto command:
[bob@server1 ~]$ hdfs dfs -mkdir /myzone
[bob@server1 ~]$ hdfs crypto -createZone -keyName myzonekey -path /myzone
[bob@server1 ~]$ hdfs crypto -listZones
/myzone myzonekey
[bob@server1 ~]$
From here, HDFS clients can read and write files in the /myzone directory
and have them be transparently encrypted or decrypted.
TIP
MapReduce2 Intermediate Data Encryption
Intermediate data encryption is on a per-job basis (client configuration)
Users might not know that the source data came from an encryption zone
Users might not enable intermediate data encryption properly
Users might disable intermediate data encryption because of
performance impacts
Intermediate data encryption is only available for MR2 not MR1
The job configuration properties shown in Table 9-1 are used to enable
intermediate data encryption.
Table 9-1. Intermediate data encryption properties
Property Description
mapreduce.job.encrypted-intermediate-data Set to true to enable (default:
false)
mapreduce.job.encrypted-intermediate-data- The key length size for encryption
key-size-bits (default: 128)
Property Description
mapreduce.job.encrypted-intermediate-
data.buffer.kb
The buffer size to use in KB (default:
128)
It is certainly desirable to have intermediate data encryption that is both
enforced and enabled when actually necessary. We hope that in a later
Hadoop release this implementation will improve such that MapReduce tasks
will encrypt intermediate files automatically if it detects that it is reading
data sourced from an encryption zone, and that this feature is not able to be
overridden by a client.
To configure Impala daemons to protect data it spills to disk, the following
startup flags are needed:
disk_spill_encryption
Set this to true to turn on the encryption of all data spilled to disk during
a query. Default: false. When data is about to be spilled to disk, it is
encrypted with a randomly generated AES 256-bit key. When read back
from disk, it’s decrypted.
disk_spill_integrity
Set this to true to turn on an integrity check of all data spilled to disk
during a query. Default: false. When data is about to be spilled to disk,
a SHA256 hash of the data is taken. When read back in from disk, a
SHA256 is again taken and compared to the original. This prevents
tampering of data spilled to disk.
If you’re using a version of HDFS that doesn’t support native encryption or if
you need to encrypt the data used by other Hadoop ecosystem components,
then you might want to consider full disk encryption or filesystem encryption.
Let’s take a look at full disk encryption using the Linux Unified Key Setup(LUKS). In addition to LUKS, there are several products for full disk
encryption. We will focus on LUKS, as it is a common open source tool for
enabling full disk encryption on Linux.
WARNING
Data encryption is not something you want to experiment with in production or on real
data. A mistake could cause your data to be permanently unrecoverable.
Most LUKS implementations use cryptsetup and dm-crypt found in Linux
distributions:
cryptsetup provides the user space tools to create, configure, and
administer the encrypted volumes
dm-crypt provides the Linux kernel space logic to encrypt the block
device
In Example 9-2, we show how to configure LUKS on a device using the
command line. Some Linux distributions have tools that allow for simple
configuration during OS installation. This can be as easy as checking a box
or selecting an option to enable full-disk encryption when setting up storage
drives, adding additional drives, or re-partitioning existing drives. This
hides all the complexity of using cryptsetup and dm-crypt. We encourage
you to use the distribution-provided tools when possible.
WARNING
When you set up LUKS on a device, data on the device is overwritten. If you’re setting up
LUKS on a device that already has data, first make a backup of the entire device and then
restore the data after LUKS is configured. Exercise caution when performing the LUKS
configuration.
Install crypsetup.
On CentOS/RHEL:
[root@hadoop01 ~]# yum install cryptsetup-luks
On Debian/Ubuntu:
[root@hadoop01 ~]# apt-get install cryptsetup
Set up the LUKS storage device.
[root@hadoop01 ~]# cryptsetup -y -v luksFormat /dev/xvdc
WARNING!
========
This will overwrite data on /dev/xvdc irrevocably.
Are you sure? (Type uppercase yes): YES
Enter LUKS passphrase:
Verify passphrase:
Command successful.
Open the device and map it to a new device:
[root@hadoop01 ~]# cryptsetup luksOpen /dev/xvdc data1
This creates a new mapping device on /dev/mapper/data1.
Clear all the data on the device (this is mainly to clear the header, but
it’s a good security practice to clear it all):
[root@hadoop01 ~]# dd if=/dev/zero of=/dev/mapper/data1
NOTE
The preceding operation is writing zeros over the entire storage device so it can
take minutes to hours to complete, depending on the size of the device and the
speed of your system.
When the dd command completes, you can create your filesystem; in
this case, we will use ext4, but you can also use XFS or your desired
filesystem format:
[root@hadoop01 ~]# mkfs.ext4 /dev/mapper/data1
Now that you have an encrypted device with a filesystem, you can
mount it like a regular filesystem:
[root@hadoop01 ~]# mkdir /data/dfs/data1
[root@hadoop01 ~]# mount /dev/mapper/data1 /data/dfs/data1
[root@hadoop01 ~]# df -H
[root@hadoop01 ~]# ls -l /data/dfs/data1
Repeat the previous steps for your other drives mounted on
/data/dfs/data[2-N] and then install Hadoop using /data/dfs/data[1-N]
for HDFS storage.
eCryptfs has two main components, ecryptfs-utils and ecryptfs:
ecryptfs-utils provides the user space tools to create, configure, and
administer the encrypted directories
ecryptfs provides the Linux kernel space logic to layer the encrypted
filesystem over the directories on the existing filesystem
In Example 9-3, we show you how to configure eCryptfs using the command
line. Some Linux distributions have tools that allow for simple configuration
during OS install. This can be as easy as checking a box or selecting an
option to set up storage drives, add additional drives, or repartition existing
drives. This hides all the complexity of using ecryptfs-utils and
ecryptfs for you. We encourage you to use the distribution-provided tools
when possible.
Example 9-3. eCryptfs encryption
Install ecryptfs-utils.
On CentOS/RHEL:
[root@hadoop01 ~]# yum install ecryptfs-utils
On Debian/Ubuntu:
[root@hadoop01 ~]# apt-get install ecryptfs-utils
Mount a new encrypted filesystem over your empty HDFS data
directory.
[root@hadoop01 ~]# mount -t ecryptfs /data/dfs/data1 /data/dfs/data1
Select key type to use for newly created files:
passphrase
tspi
openssl
Selection: 1
Passphrase:
Select cipher:
aes: blocksize = 16; min keysize = 16; max keysize = 32 (not
loaded)
blowfish: blocksize = 16; min keysize = 16; max keysize = 56 (not
loaded)
des3_ede: blocksize = 8; min keysize = 24; max keysize = 24 (not
loaded)
twofish: blocksize = 16; min keysize = 16; max keysize = 32 (not
loaded)
cast6: blocksize = 16; min keysize = 16; max keysize = 32 (not
loaded)
cast5: blocksize = 8; min keysize = 5; max keysize = 16 (not
loaded)
Selection [aes]: aes
Select key bytes:
1) 16
2) 32
3) 24
Selection [16]: 32
Enable plaintext passthrough (y/n) [n]: n
Enable filename encryption (y/n) [n]: n
Attempting to mount with the following options:
ecryptfs_unlink_sigs
ecryptfs_key_bytes=32
ecryptfs_cipher=aes
ecryptfs_sig= 9808e34a098f3814
WARNING: Based on the contents of [/root/.ecryptfs/sig-cache.txt],
it looks like you have never mounted with this key
before. This could mean that you have typed your
passphrase wrong. Would you like to proceed with the mount (yes/no)?
: yes
Would you like to append sig [9808e34a098f3814] to
[/root/.ecryptfs/sig-cache.txt]
in order to avoid this warning in the future (yes/no)? : yes
Successfully appended new sig to user sig cache file
Mounted eCryptfs
NOTE
During the mount command, you’ll be prompted for the size of the key in bytes.
Previously, we described the desired key size as 256 bits. Because there are 8 bits
in a byte, we will select a 32-byte key.
Repeat the preceding steps for your other drives mounted on
/data/dfs/data[2-N] and then install Hadoop using /data/dfs/data[1-N]
for HDFS storage.
Important Data Security Consideration for Hadoop
If you are configuring encryption for data at rest for Hadoop, you should take
notice that sensitive data may not only land in HDFS, but in other areas such
as shuffles, spill fills, temporary files, logfiles, swap files, indexes, and
metadata stores that run on MySQL, PostgreSQL, SQLite, Oracle, or Derby.
In Chapter 10, we will cover some of the areas where Hadoop offers
encryption for those other data sets outside of HDFS.
Being a clever kid, Alice came up with her own alphabet with symbols that
map one-to-one to letters in the English alphabet. Instead of writing their
messages using English letters, they use this custom alphabet. Alice and Bob
can exchange a copy of the mapping in advance or even memorize the
alphabet (there are only 26 symbols after all). Now when they send their
notes, only those with a copy of the alphabet map can read them.
This method is simple and effective, but it’s not absolutely secure. More
sophisticated methods might include having multiple alphabets or the same
alphabet with randomized mappings along with a key that tells the recipient
which mapping to use. This is probalby overkill for passing notes, but it
becomes very important when designing encryption systems for data in
transit.
Transport Layer Security (TLS) is a cryptographic protocol for encrypting
data in transit. TLS replaced the Secure Socket Layer (SSL), an early
standard for encrypting data in transit. TLS was first defined in RFC 2246
based on the SSL 3.0 protocol designed by Paul Kocher. Given the shared
history of TLS and SSL, the two are often used interchangeably even though
they are not the same. It is also common to use the same library for
implementing either SSL or TLS. For example, the OpenSSL library includes
implementations of SSL 2.0 and 3.0, as well as TLS 1.0, 1.1, and 1.2.
Whereas Chapter 4 is a protocol for enabling strong authentication, SSL/TLS
are protocols for securing data as it moves through the network. While most
commonly associated with web traffic in the form of the HTTPS protocol,
SSL/TLS are generic protocols that can be used to secure any socket
connection. This lets you create an encrypted pipe that other protocols can
then be layered on top of. In the same way that Kerberos clients rely on
trusting the KDC, clients using SSL/TLS trust a central certificate authority(CA).
The following are basic concepts that underpin SSL/TLS:
Privatekey
An asymmetric encryption key that is known only to the owner of a signed
certificate.
Public key
Certificate signing request (CSR)
A cryptographic message sent to a certificate authority to apply for a
specific identity.
Signed certificate
PKCS #12
A file format that bundles the private key and the signed certificate.
While there are many technical details of SSL/TLS that we will not cover
here, there are a few things you should understand which are covered in the
following basic workflow example.
Generating a new certificate
An administrator for the service seeking to accept SSL/TLS
connections generates a public and private key pair.
The administrator then generates a CSR and sends it to the CA.
The CA validates the identity of the server/service (and sometimes
business entity), and then generates a signed certificate.
The administrator of the service can then install the signed certificate.
SSL/TLS handshake
Alice connects to the Bob service, which presents an SSL/TLS
certificate to Alice.
Alice looks up the CA certificate that signed Bob’s certificate in her
chain of trusted third parties.
Alice and the Bob service exchange public keys, and then agree to a
newly created symmetric encryption key for the current session.
Alice sends messages to the Bob service that are encrypted in transit by
the securely exchanged symmetric key.
If Eve captures the packet of messages going from Alice to Bob, she is
unable to decrypt them because she does not possess the symmetric key.
NOTE
RSA comes from the surname initials of Ron Rivest, Adi Shamir, and Leonard
Adlemanwho, who wrote a paper on this algorithm while at MIT in 1977. As you can see
in this book, along with Kerberos, many security technologies we use today originate from
MIT.
For a more in-depth understanding on SSL/TLS, we recommend reading
Network Security with OpenSSL by John Viega, Matt Messier, Pravir
Chandra (O’Reilly), and Chapter 14, “SSL and HTTPS,” in Java Security, Edition by Scott Oaks (O’Reilly).
Hadoop Data-in-Transit Encryption
Hadoop has several methods of communication over the network, including
RPC, TCP/IP, and HTTP. API clients of MapReduce, JobTracker,
TaskTracker, NameNode and DataNodes use RPC calls. HDFS clients use
TCP/IP sockets for data transfers. The HTTP protocol is used for
MapReduce shuffles and also by many daemons for their web UIs.
Each of these three network communications have a different in-transit
encryption method. We will explore the basics of these next, and in
Chapter 10 we will cover a detailed example of Flume SSL/TLS
configuration. In Chapters 11 and 12, we also cover the use of SSL/TLS with
Oozie, HBase, Impala, and Hue.
Hadoop RPC Encryption
auth, for authentication between client and server
auth-int, for authentication and integrity
auth-conf, for authentication, integrity, and confidentiality
RPC protection in Hadoop is configured with the ![]()
property in the core-site.xml file. This property can be set to the values:
![]()
![]()
![]()
![]()
![]()
, the default, puts SASL into mode and provides
only authentication
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
puts SASL into mode and adds integrity checking
in addition to authenticaiton
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
puts SASL into mode and adds encryption to ensure
full confidentiality
To configure Hadoop RPC protection, set the value in your core-site.xml as
shown here (keeping in mind that all daemons need to be restarted for it to
take effect):

HDFS data transfer protocol encryption
When HDFS data is transferred from one DataNode to another or between
DataNodes and their clients, a direct TCP/IP socket is used in a protocol
known as the HDFS data transfer protocol. The Hadoop RPC protocol is
used to exchange an encryption key for use in the data transfer protocol when
data transfer encryption is enabled.
![]()
![]()
![]()
To configure data transfer encryption, set
to
true in the hdfs-site.xml file. This change is required only on the DataNodes.
RPC will be used to exchange the encryption keys, so ensure that RPC
encryption is enabled by setting the
configuration
to , as described earlier. The encryption algorithm should also be
configured to use AES. In the following code, we configure AES encryption:
![]()
![]()
![]()
![]()
![]()
![]()
![]()

![]()
NOTE
Setting AES encryption using the
setting is a more recent Hadoop feature, added in version 2.6. For earlier releases, you can
set
between triple-DES or RC4 respectively.
to
(default) or
to choose

You will need to restart your DataNode and NameNode daemons after this is
set to take effect. The entire process can be done manually, and Hadoop
distributions might also offer automated methods to enable HDFS data
transfer encryption.
Hadoop HTTP encryption
When it comes to HTTP encryption, there is a well-known and proven
method to encrypt the data in transit using HTTPS, which is an enhancement
of HTTP with SSL/TLS. While HTTPS is very standardized, configuring it in
Hadoop is not. Several Hadoop components support HTTPS, but they are not
all configured with the same steps.
As you may recall from the description of the basic SSL/TLS concepts, a few
additional files are required, like private keys, certificates, and PKCS #12
bundles. When using Java, these files are stored in a Java keystore. Many of
the HTTPS configuration steps for Hadoop generate these objects, store them
in the Java keystore, and finally, configure Hadoop to use them.
Some Hadoop components are both HTTPS servers and clients to other
services. A few examples are:
HDFS, MapReduce, and YARN daemons act as both SSL servers and
clients
HBase daemons act as SSL servers only
Oozie daemons act as SSL servers only
Hue acts as an SSL client to all of the above
We will not cover HTTPS configuration in depth. Instead we will focus on
the MapReduce encrypted shuffle and encrypted web UI configuration as a
starting point to configure other components.
Encrypted shuffle and encrypted web UI
Encrypted shuffle is supported for both MR1 and MR2. In MR1, setting the
property in the core-site.xml file enables both the
encrypted shuffle and the encrypted web UI. In MR2, setting the
property enables the encrypted web UI feature only;
setting the
property in the mapred-
site.xml file enables the encrypted shuffle feature.
TIP
When configuring HTTPS, just as with Kerberos, it is important to set up all your servers
with their full hostnames and to configure DNS to resolve correctly to these names across
the cluster.
![]()
![]()
![]()
For both MR1 or MR2, set the property in core-site.xml; this will enable
the encrypted web UI. For MR1, this also enables the encrypted shuffle:

For MR2 only, set the encrypted shuffle SSL property in mapred-site.xml:
![]()
![]()
![]()

You can also optionally set the
property
to control how hostname verification happens. Valid values are:
![]()
![]()
![]()
The hostname must match either the first CN or any of the subject-alt
names. If a wildcard exists on either the CN or one of the subject-alt
names, then it matches all subdomains.
![]()
![]()
![]()
![]()
This behaves the same as
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
,
always pass.
with the addition that a host of
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
, , and will
![]()
![]()
![]()
![]()
![]()
![]()
This behaves like
level. For example,
.
, but only matches wildcards on the same
matches
but not
![]()
Accepts any hostname. This mode should only be used in testing because
it is not secure.
For example, to support default plus localhost mode, set the following:
![]()
![]()

You will also need to update your ssl-server.xml and ssl-client.xml files.
These files are typically located in the /etc/hadoop/conf directory. The
settings that go into the ssl-server.xml file are shown in Table 9-2.
Table 9-2. Keystore and Truststore settings for ssl-server.xml
Property Default
value
Description
![]()
![]()
![]()
![]()
![]()
![]()
The keystore file type
The path to the keystore file; this
fileshould be owned by the
mapred user and the mapred user
must have exclusive read access
to it (i.e., permission 400)
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
The password to the keystore file
The truststore file type
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
The path to the truststore file; this
file should be owned by the
mapred user and the mapred user
have exclusive read access to it
(i.e., permission 400)
The password to the truststore file
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Number of milliseconds between
reloading the truststore file
An example, fully configured ssl-server.xml file looks like this:
![]()
![]()
![]()
![]()
![]()

![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()


The settings that go into the ssl-client.xml file are shown in Table 9-3.
Table 9-3. Keystore and truststore settings for ssl-client.xml
Property Default
value
Description
![]()
![]()
![]()
![]()
![]()
![]()
The keystore file type.
The path to the keystore file; this
file should be owned by the
mapred user and all users that can
run a MapReduce job should have
read access (i.e., permission 444).
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
The password to the keystore file.
The truststore file type.
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
The path to the keystore file; this
file should be owned by the
mapred user and all users that can
run a MapReduce job should have
read access (i.e., permission 444).
Property Default
value
Description
The password to the truststore file.
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Number of milliseconds between
reloading the truststore file.
An example, fully configured ssl-client.xml file looks like this:
![]()
![]()
![]()
![]()
![]()

![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()


WARNING
configure a firewall, such as the iptables software firewall, to disable access to port 80.
After you set up your ssl-server.xml and ssl-client.xml files, you need to
restart all the TaskTrackers in MR1 and NodeManagers in MR2 for the
changes to take effect.
When dealing with data security, how you delete the data is important. If you
happen to reuse servers in your cluster that may have previously been used
with sensitive data, you will want to destroy the data first—for instance, in
Example 9-2, we used dd to zero out the LUKS partition.
You can do a more thorough destruction of data using the GNU shred utility.
The shred utility will overwrite a file or device with random patterns to
better obfuscate the previous data that was written. You can pass shred a
number of iterations to run, with three passes being the default. The old DoD
5220.22-M standard mandated that a 7-pass overwrite was required to
securely erase sensitive data. The most secure mode implements the Gutmann
method, which requires 35 passes using a combination of random data and
specially selected data patterns.
Performing 35 overwrite passes on a large disk is a time-consuming
operation. Assuming your disk can write at a sustained 100 MB/s, it will take
over 200 hours, or roughly 8.5 days, to fully overwrite a 2 TB drive 35
times. When dealing with a cluster of hundreds of machines and thousands of
disks, this is a huge undertaking even if you perform the sanitization in
parallel.
In this chapter, we discussed how encryption is used to protect data from
unauthorized access by users and administrators of a Hadoop cluster. We
compared and contrasted protecting data at rest and data in transit. We
described how HDFS has recently added native data-at-rest encryption along
with alternative approaches applicable to earlier versions, as well as to data
that lives outside of HDFS. We also showed how intermediate data that is
generated during a data processing job or query can also be encrypted to
provide end-to-end protection.
Next, we discussed methods of protecting data in transit starting with
Hadoop RPC encryption. We followed this with protection of data from
HDFS clients to DataNodes and between DataNodes in the form of HDFS
data transfer protocol encryption. We also discussed how to encrypt the
HTTP endpoints and the MapReduce shuffle with SSL/TLS. Lastly, we
described extending the protection of data to the operational end of life for
hardware by describing methods of permanent data destruction.
The next two chapters will explore holistically securing your Hadoop
environment by extending data security to your data ingest pipeline and client
access, respectively.
Chapter 10. Securing Data
Ingest
There are many ways for data to be ingested into Hadoop. The simplest
method is to copy files from a local filesystem (e.g., a local hard disk or an
NFS mount) to HDFS using Hadoop’s put command, as shown in
Example 10-1.
Example 10-1. Ingesting files from the command line
[alice@hadoop01 ~]$ hdfs dfs -put /mnt/data/sea*.json
/data/raw/sea_fire_911/
While this method might work for some datasets, it’s much more common to
ingest data from existing relational systems or set up flows of event- or log-
oriented data. For these use cases, users use Sqoop and Flume, respectively.
Sqoop is designed to either pull data from a relational database into Hadoop
or to push data from Hadoop into a remote database. In both cases, Sqoop
launches a MapReduce job that does that actual data transfer. By default,
Sqoop uses JDBC drivers to transport data between the map tasks and the
database. This is called generic mode and it makes it easy to use Sqoop with
new data stores, as the only requirement is the availability of JDBC drivers.
For performance reasons, Sqoop also supports connectors that can use
vendor-specific tools and interfaces to optimize the data transfer. To enable
these optimizations, users specify the --direct option to enable direct. For example, when enabling direct mode for MySQL, Sqoop will use
the mysqldump and mysqlimport utilities to extract from or import to
MySQL much more efficiently.
Flume includes an AvroSource and an AvroSink that uses Avro RPC to
transfer events. You can configure the AvroSink of one Flume agent to send
events to the AvroSource of another Flume agent in order to build complex,
distributed data flows. While Flume also supports a wide variety of sources
and sinks, the primary ones used to implement inter-agent data flow are the
AvroSource and AvroSink, so we’ll restrict the rest of our discusion to this
pair. The reliability of Flume is determined by the configuration of the
channel. There are in-memory channels for data flows that prioritize speed
over reliability, as well as disk-backed channels that support full
recoverability. Figure 10-1 shows a two-tier Flume data flow showing the
components internal to the agents as well as their interconnection.
Figure 10-1. Flume architecture

Because Sqoop and Flume can be used to transfer sensitive data, it is
important to consider the security implications of your ingest pipeline in the
context of the overall deployment. In particular, you need to worry about the
confidentiality, integrity, and availability (CIA) of your ingest pipeline.
Confidentiality refers to limiting access to the data to a set of authorized
users. Systems typically guarantee confidentiality by a combination of
authentication, authorization, and encryption. Integrity refers to how much
you can trust that data hasn’t been tampered with. Most systems employ
checksums or signatures to verify the integrity of data. Availability refers to
keeping information resources available. In the context of data ingest, it
means that your ingestion system is robust against the loss of some capacity
and that it has the ability to preserve in-transit data while it’s dealing with
the outage of some downstream system or service.
data ingest flows are often equally complex and distributed. That means there
are multiple places where data can be worked with, corrupted, and tampered
with. Your specific threat model will determine the level of integrity your
ingest pipeline requires. Most use cases are concerned with accidentally
corrupted data, and for those a simple checksum of records or files is
sufficient. To prevent tampering by malicious users, you can add
cryptographic signatures and/or encryption of records.
One of the primary ways that Flume guarantees integrity is through its built-
in, reliable channel implementations. Flume channels present a very simple
interface that resembles an unbounded queue. Channels have a put(Event
event) method for putting an event into a channel and a take() method for
taking the next event from the channel. The default channel implementation is
an in-memory channel. This implementation is reliable but only so long as the
Flume agent stays up. This means that in the event of a process or server
crash, data will be lost. Furthermore, because the events never leave
memory, Flume assumes that events can’t be tampered with and does not
calculate or verify event checksums.
For users that care about reliable delivery of events and integrity, Flume
offers a file-based channel. The file channel essentially implements a write-
ahead log that is persisted to stable storage as each event is put into the
channel. In addition to persisting the events to disk, the file channel
calculates a checksum of each event and writes the checksum to the write-
ahead log along with the event. When events are taken from the channel, the
checksum is verified to ensure that the event has not been corrupted. This
provides some integrity guarantees but is limited to the integrity of the event
as it passes through the channel. Currently, Flume does not calculate
checksums when passing events from one agent to another from the AvroSink
to the AvroSource. TCP will still protect against accidental corruption of
packets, but a man-in-the-middle who is able to manipulate packets could
still corrupt data in a manner that is not detected. In the next section, we’ll
see that Flume does have the ability to encrypt the RPC protocol, which
would prevent undetected corruption by a man-in-the-middle attack.
Before moving on to the integrity offered by Sqoop, let’s quickly cover how
Flume approaches availability. Flume lets users build a distributed data flow
that guarantees at-least-once delivery semantics. Strictly speaking, Flume is
available from the point-of-view of a particular data source as long as the
first agent that communicates with the external source is available. As events
proceed from agent to agent, any downtime of a downstream agent can be
handled by using a failover sink processor that targets two or more
downstream Flume agents as the target. You can also have both failure
handling and load balancing by using the load balancing sink processor. This
processor will send events to a set of downstream sinks in either a round
robin or random fashion. If the downstream sink fails, it will retry with the
next sink.
Both of these mechanisms improve availability of the overall data flow, but
they don’t guarantee availability of any particular event. There is a proposal
to add a replicating channel that would replicate events to multiple agents
before acknowledging the source, but until that is in place, events will be
unavailable while nodes in the Flume cluster are down. No events will be
lost unless the stable storage where the file channel logs data is
unrecoverable. When building an ingest flow, it’s important to keep these
considerations in mind. We’ll go into more detail on these kinds of trade-offs
in “Enterprise Architecture”.
process and may require multiple full table scans depending on your
database’s checksum capabilities.
)To solve the problem of unauthorized access to data as it transits the
network, Flume supports enabling SSL encryption on your AvroSource and
AvroSink. In addition to providing encryption, you can configure the
AvroSource and AvroSink with trust policies to ensure that a sink is only
sending data to a trusted source. Let’s suppose we want to send events from a
Flume agent running on flume01.example.com to a second agent running on
flume02.example.com. The first thing we have to do is create an RSA
private key for flume02 using the openssl command-line tool, as shown in
Example 10-2.
Example 10-2. Creating a private key
[alice@flume02 ~]$ mkdir certs
[alice@flume02 ~]$ cd certs
[alice@flume02 certs]$ openssl genrsa -des3 -out flume02.key 1024
Generating RSA private key, 1024 bit long modulus
............................................................................
...
........++++++
.....................++++++
e is 65537 (0x10001)
Enter pass phrase for flume02.key:
Verifying - Enter pass phrase for flume02.key:
[alice@flume02 certs]$
In Example 10-3, we generate a certificate signing request so that a
certificate can be issued to the private key we just created.
Example 10-3. Creating a certificate signing request
[alice@flume02 certs]$ openssl req -new -key flume02.key -out flume02.csr
Enter pass phrase for flume02.key:
You are about to be asked to enter information that will be incorporated
into your certificate request.
What you are about to enter is what is called a Distinguished Name or a DN.
There are quite a few fields but you can leave some blank
For some fields there will be a default value,
If you enter '.', the field will be left blank.
---
Country Name (2 letter code) [XX]:US
State or Province Name (full name) []:California
Locality Name (eg, city) [Default City]:San Francisco
Organization Name (eg, company) [Default Company Ltd]:Cluster, Inc.
Organizational Unit Name (eg, section) []:
Please enter the following 'extra' attributes
to be sent with your certificate request
A challenge password []:
An optional company name []:
[alice@flume02 certs]$
Once we have the certificate signing request, we can generate a certificate
signed by a trusted key. In our example, we don’t have a root signing
authority, so we’ll just create a self-signed certificate (a certificate signed by
the same key that requested the certificate). In a real deployment, you’d send
the certificate signing request to your corporate signing authority and they
would provide the signed certificate. A self-signed certificate will work just
fine for Example 10-4.
Example 10-4. Creating a self-signed certificate
[alice@flume02 certs]$ openssl x509 -req -days 365 -in flume02.csr \
-signkey flume02.key -out flume02.crt
Signature ok
Getting Private key
Enter pass phrase for flume02.key:
[alice@flume02 certs]$
Example 10-5. Creating a Java truststore
[alice@flume02 certs]$ keytool -import -alias flume02.example.com \
-file flume02.crt -keystore flume.truststore
Enter keystore password:
Re-enter new password:
Owner: EMAILADDRESS=admin@example.com, CN=flume02.example.com, O="Cluster,
Inc.
", L=San Francisco, ST=California, C=US
Issuer: EMAILADDRESS=admin@example.com, CN=flume02.example.com, O="Cluster,
Inc
.", L=San Francisco, ST=California, C=US
Serial number: 86a6cb314f86328b
Valid from: Tue Jun 24 11:31:50 PDT 2014 until: Wed Jun 24 11:31:50 PDT 2015
Certificate fingerprints:
MD5: B6:4A:A7:98:9B:60:3F:A2:5E:0B:BA:BA:12:B4:8D:68
SHA1: AB:F4:AB:B3:2D:E1:AF:71:28:8B:60:54:2D:C1:C9:A8:73:18:92:31
SHA256:
B1:DD:C9:1D:AD:57:FF:47:28:D9:7F:A8:A3:DF:9C:BE:30:C1:49:CD:85
:D3:95:AD:95:36:DC:40:4C:72:15:AB
Signature algorithm name: SHA1withRSA
Version: 1
Trust this certificate? [no]: yes
Certificate was added to keystore
[alice@flume02 certs]$
Before we can use our certificate and key with Flume, we need to load them
into a file format that Java can read. Generally, this will be either a Java
keystore .jks file or a PKCS12 .p12. Because Java’s keytool doesn’t have
support for importing a separate key and certificate, we’ll use openssl to
generate a PKCS12 file and configure Flume to use that directly, as shown in
Example 10-6.
Example 10-6. Creating a PKCS12 file with our key and certificate
[alice@flume02 certs]$ openssl pkcs12 -export -in flume02.crt \
-inkey flume02.key -out flume02.p12 -name flume02.example.com
Enter pass phrase for flume02.key:
Enter Export Password:
Verifying - Enter Export Password:
[alice@flume02 certs]$
Prior to configuring Flume to use our certificate, we need to move thefile into Flume’s configuration directory, as shown in Example 10-.
Example 10-7. Moving PKCS12 file to /etc/flume-ng/ssl
[root@flume02 ~]# mkdir /etc/flume-ng/ssl
[root@flume02 ~]# cp ~alice/certs/flume02.p12 /etc/flume-ng/ssl
[root@flume02 ~]# chown -R root:flume /etc/flume-ng/ssl
[root@flume02 ~]# chmod 750 /etc/flume-ng/ssl
[root@flume02 ~]# chmod 640 /etc/flume-ng/ssl/flume02.p12
In Example 10-8, you’ll see that we also need to copy the truststore to
flume01.example.com so that the sink will know it can trust the source on
flume02.example.com.
Example 10-8. SCP truststore to flume01.example.com
[root@flume02 ~]# scp ~alice/certs/flume.truststore
flume01.example.com:/tmp/
Next, in Example 10-9, we move the truststore into Flume’s configuration
directory.
Example 10-9. Moving truststore to /etc/flume-ng/ssl
[root@flume01 ~]# mkdir /etc/flume-ng/ssl
[root@flume01 ~]# mv /tmp/flume.truststore /etc/flume-ng/ssl
[root@flume01 ~]# chown -R root:flume /etc/flume-ng/ssl
[root@flume01 ~]# chmod 750 /etc/flume-ng/ssl
[root@flume01 ~]# chmod 640 /etc/flume-ng/ssl/flume.truststore
Now that the PKCS12 and truststore files are in place, we can configure
Flume’s source and sink. We’ll start with the sink on
flume01.example.com. The key configuration parameters are as follows:
ssl
Set to true to enable SSL for this sink. When SSL is enabled, you also
need to configure the trust-all-certs, truststore, truststore-
password, and truststore-type parameters.
trust-all-certs
Set to true to disable certificate verification. It’s highly recommended
that you set this parameter to false, as that will ensure that the sink
checks that the source it connects to is using a trusted certificate.
truststore
Set this to the full path of the Java truststore file. If left blank, Flume will
use the default Java certificate authority files. The Oracle JRE ships with
a file called $JAVA_HOME/jre/lib/security/cacerts, which will be used
unless a site-specific truststore is created in
$JAVA_HOME/jre/lib/security/jssecacerts.
truststore-password
Set this to the password that protects the truststore.
truststore-type
Set this to JKS or another supported truststore type.
Example 10-10 shows an example configuration.
Example 10-10. Avro SSL sink configuration
a1.sinks = s1
a1.channels = c1
a1.sinks.s1.type = avro
a1.sinks.s1.channels = c1
a1.sinks.s1.hostname = flume02.example.com
a1.sinks.s1.port = 4141
a1.sinks.s1.ssl = true
a1.sinks.s1.trust-all-certs = false
a1.sinks.s1.truststore = /etc/flume-ng/ssl/flume.truststore
a1.sinks.s1.truststore-password = password
a1.sinks.s1.truststore-type = JKS
On flume02.example.com, we can configure the AvroSource to use our
certificate and private key to listen for connections. The key configuration
parameters are as follows:
ssl
Set to true to enable SSL for this sink. When SSL is enabled, you also
need to configure the keystore, keystore-password, and keystore-
type parameters.
keystore
Set this to the full path of the Java keystore.
keystore-password
Set this to the password that protects the keystore.
keystore-type
Set this to JKS or PKCS12.
Example 10-11. Avro SSL source configuration
a2.sources = r1
a2.channels = c1
a2.sources.r1.type = avro
a2.sources.r1.channels = c1
a2.sources.r1.bind = 0.0.0.0
a2.sources.r1.port = 4141
a2.sources.r1.ssl = true
a2.sources.r1.keystore = /etc/flume-ng/ssl/flume02.p12
a2.sources.r1.keystore-password = password
a2.sources.r1.keystore-type = PKCS12
In addition to protecting your data over the wire, you might need to ensure
that data is encrypted on the drives where Flume writes events as they transit
a channel. One option is to use a third-party encryption tool that supports full
disk encryption on the drives your Flume channel writes to. This could also
be done with dm-crypt/LUKS.1 However, full disk encryption might be
overkill, especially if Flume is not the only service using the log drives or if
not all events need to be encrypted.
For those use cases, Flume offers the ability to encrypt the logfiles used by
the file channel. The current implementation only supports AES encryption in
Counter mode with no padding (AES/CTR/NOPADDING), but it is possible
to add additional algorithms and modes in the future. Flume currently only
supports the JCE keystore implementation (JCEKS) as the key provider.
Again, nothing precludes adding support for additional key providers but it
would require a modification to Flume itself, as there is not currently a
pluggable interface for adding key providers. Despite these limitations,
Flume does support key rotation to help improve security. Because the file
channel logfiles are relatively short lived, you can rotate keys as frequently
as necessary to meet your requirements. In order to ensure that logfiles
written with the previous key are still readable, you must maintain old keys
for reading while only the newest key is used for writing.
To set up Flume’s on-disk encryption for the file channel, start by generating
a key, as shown in Example 10-12.
Example 10-12. Generating the key for on-disk encrypted file channel
[root@flume01 ~]# mkdir keys
[root@flume01 ~]# cd keys/
[root@flume01 keys]# keytool -genseckey -alias key-0 -keyalg AES -keysize
256 \
-validity 9000 -keystore flume.keystore -storetype jceks
Enter keystore password:
Re-enter new password:
Enter key password for <key-0>
(RETURN if same as keystore password):
Re-enter new password:
[root@flume01 keys]#
In our example, we set the keystore password to keyStorePassword and the
key password to keyPassword. In a real deployment, stronger passwords
should be used. Keytool won’t show what you are typing nor will it show the
familiar asterisk characters, so type carefully. You can also provide the
keystore password and key password on the command line with -storepass
keyStorePassword and -keypass keyPassword, respectively. It’s
generally not recommended to include passwords on the command line, as
they will typically get written to your shell’s history file, which should not be
considered secure. Next, let’s copy the keystore to Flume’s configuration
directory in Example 10-13.
Example 10-13. Copying the keystore to Flume’s configuration directory
[root@flume01 ~]# mkdir /etc/flume-ng/encryption
[root@flume01 ~]# cp ~/keys/flume.keystore /etc/flume-ng/encryption/
[root@flume01 ~]# cat > /etc/flume-ng/encryption/keystore.password
keyStorePassword
^D
[root@flume01 ~]# cat > /etc/flume-ng/encryption/key-0.password
keyPassword
^D
[root@flume01 ~]# chown -R root:flume /etc/flume-ng/encryption
[root@flume01 ~]# chmod 750 /etc/flume-ng/encryption
[root@flume01 ~]# chmod 640 /etc/flume-ng/encryption/*
[root@flume01 ~]#
Notice that we also created files that contain the keystore password and key
passwords. Where the listing shows ^D, you should hold down the Control
key and type the letter D on the keyboard. Generating these files helps to
keep the passwords protected, as they won’t be accessible to the same users
that can read Flume’s configuration file. Now we can configure Flume to
enable encryption on the file channel, as shown in Example 10-14.
Example 10-14. Encrypted file channel configuration
a1.channels = c1
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /data/01/flume/checkpoint
a1.channels.c1.dataDirs = /data/02/flume/data,/data/03/flume/data
a1.channels.c1.encryption.cipherProvider = AESCTRNOPADDING
a1.channels.c1.encryption.activeKey = key-0
a1.channels.c1.encryption.keyProvider = JCEKSFILE
a1.channels.c1.encryption.keyProvider.keyStoreFile =
/etc/flume-ng/encryption/flume.keystore
a1.channels.c1.encryption.keyProvider.keyStorePasswordFile =
/etc/flume-ng/encryption/keystore.password
a1.channels.c1.encryption.keys = key-0
a1.channels.c1.encryption.keys.key-0.passwordFile =
/etc/flume-ng/encryption/key-0.password
WARNING
Examples 10-14 through 10-16 show some configuration settings
(a1.channels.c1.encryption.keyProvider.keyStorePasswordFile,
a1.channels.c1.encryption.keys.key-0.passwordFile) split across two lines.
These are meant to improve readability of the examples, but are not valid for a Flume
configuration file. All setting names and values must be on the same line.
Over time, it might become necessary to rotate in a new encryption key to
mitigate the risk of an older key becoming compromised. Flume supports
configuring multiple keys for decryption while only using the latest key for
encryption. The old keys must be maintained to ensure that old logfiles that
were written before the rotation can still be read. We can extend our example
in Example 10-15 by generating a new key and updating Flume to make it the
active key.
Example 10-15. Generating a new key for on-disk encrypted file channel
[root@flume01 ~]# keytool -genseckey -alias key-1 -keyalg AES -keysize 256 \
-validity 9000 -keystore /etc/flume-ng/encryption/flume.keystore \
-storetype jceks
Enter keystore password:
Enter key password for <key-1>
(RETURN if same as keystore password):
Re-enter new password:
[root@flume01 ~]# cat > /etc/flume-ng/encryption/key-1.password
key1Password
^D
[root@flume01 ~]# chmod 640 /etc/flume-ng/encryption/*
[root@flume01 ~]#
Now that we’ve added our new key to the keystore and created the associate
key password file, we can update Flume’s configuration to make the new key
the active key, as shown in Example 10-16.
Example 10-16. Encrypted file channel new key configuration
a1.channels = c1
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /data/01/flume/checkpoint
a1.channels.c1.dataDirs = /data/02/flume/data,/data/03/flume/data
a1.channels.c1.encryption.cipherProvider = AESCTRNOPADDING
a1.channels.c1.encryption.activeKey = key-1
a1.channels.c1.encryption.keyProvider = JCEKSFILE
a1.channels.c1.encryption.keyProvider.keyStoreFile =
/etc/flume-ng/encryption/flume.keystore
a1.channels.c1.encryption.keyProvider.keyStorePasswordFile =
/etc/flume-ng/encryption/keystore.password
a1.channels.c1.encryption.keys = key-0 key-1
a1.channels.c1.encryption.keys.key-0.passwordFile =
/etc/flume-ng/encryption/key-0.password
a1.channels.c1.encryption.keys.key-1.passwordFile =
/etc/flume-ng/encryption/key-1.password
Here is a summary of the parameters for configuring file channel encryption:
encryption.activeKey
The alias for the key used to encrypt new data.
encryption.cipherProvider
The type of the cipher provider. Supported providers:
AESCTRNOPADDING
encryption.keyProvider
The type of the key provider. Supported providers: JCEKSFILE
encryption.keyProvider.keyStoreFile
The path to the keystore file.
encryption.keyProvider.keyStorePasswordFile
The path to a file that contains the password for the keystore.
encryption.keyProvider.keys
A space-delimited list of key aliases that are or have been the active key.
encryption.keyProvider.keys.<key>.passwordFile
An optional path to a file that contains the password for the key key. If
omitted, the password from the keystore password file is used for all
This support is not necessarily limited to the generic JDBC implementation.
If the tool that is used to implement --direct mode supports SSL, you can
still encrypt data even when using the direct connector.
Let’s take a look at how we can use SSL to encrypt traffic between Sqoop
and MySQL.2 Examples 10-17 and 10-18 assume that SSL is already
configured for MySQL.3 If you have not already done so, download the
MySQL JDBC drivers from MySQL’s connector download page. After you
download the connector, install it in a location to make it available to Sqoop,
as shown in Example 10-17.
Example 10-17. Installing the MySQL JDBC driver for Sqoop
[root@sqoop01 ~]# SQOOP_HOME=/usr/lib/sqoop
[root@sqoop01 ~]# tar -zxf mysql-connector-java-*.tar.gz
[root@sqoop01 ~]# cp mysql-connector-java-*/mysql-connector-java-*-bin.jar \
${SQOOP_HOME}/lib
[root@sqoop01 ~]#
When the driver is in place, you can test the connection by using Sqoop’s
list-tables command, as shown in Example 10-18.
Example 10-18. Testing SSL connection by listing tables
[alice@sqoop01 ~]$ URI="jdbc:mysql://mysql01.example.com/sqoop"
[alice@sqoop01 ~]$ URI="${URI}?verifyServerCertificate=false"
[alice@sqoop01 ~]$ URI="${URI}&useSSL=true"
[alice@sqoop01 ~]$ URI="${URI}&requireSSL=true"
[alice@sqoop01 ~]$ sqoop list-tables --connect ${URI} \
--username sqoop -P
Enter password:
cities
countries
normcities
staging_cities
visits
[alice@sqoop01 ~]$
The parameters that tell the MySQL JDBC driver to use SSL encryption are
provided as options to the JDBC URI passed to Sqoop:
verifyServerCertificate
Controls whether the client should validate the MySQL server’s
certificate. If set to true, you also need to set
trustCertificateKeyStoreUrl, trustCertificateKeyStoreType,
and trustCertificateKeyStorePassword.
useSSL
When set to true, the client will attempt to use SSL when talking to the
server.
requireSSL
When set to true, the client will reject connections if the server doesn’t
support SSL.
Now let’s try importing a table over SSL in Example 10-19.
Example 10-19. Importing a MySQL table over SSL
[alice@sqoop01 ~]$ URI="jdbc:mysql://mysql01.example.com/sqoop"
[alice@sqoop01 ~]$ URI="${URI}?verifyServerCertificate=false"
[alice@sqoop01 ~]$ URI="${URI}&useSSL=true"
[alice@sqoop01 ~]$ URI="${URI}&requireSSL=true"
[alice@sqoop01 ~]$ sqoop import --connect ${URI} \
--username sqoop -P --table cities
Enter password:
...
14/06/27 16:09:07 INFO mapreduce.ImportJobBase: Retrieved 3 records.
[alice@sqoop01 ~]$ hdfs dfs -cat cities/part-m-*
1,USA,Palo Alto
2,Czech Republic,Brno
3,USA,Sunnyvale
[alice@sqoop01 ~]$
You can see that it is as simple as once again including the SSL parameters in
the JDBC URI. We can confirm that the SSL parameters are used while the
job executes by looking at the configuration of the job in the Job History
Server’s page, as shown in Figure 10-2.
Figure 10-2. Job History Server page showing use of SSL JDBC settings

In the previous example, we set verifyServerCertificate to false.
While this is useful for testing, in a production setting we’d much rather
verify that the server we’re connecting to is in fact the server we expect it to
be. Let’s see what happens if we attempt to set that parameter to true in
Example 10-20.
Example 10-20. Certificate verification fails without a truststore
[alice@sqoop01 ~]$ URI="jdbc:mysql://mysql01.example.com/sqoop"
[alice@sqoop01 ~]$ URI="${URI}?verifyServerCertificate=true"
[alice@sqoop01 ~]$ URI="${URI}&useSSL=true"
[alice@sqoop01 ~]$ URI="${URI}&requireSSL=true"
[alice@sqoop01 ~]$ sqoop list-tables --connect ${URI} \
--username sqoop -P
Enter password:
14/06/30 10:52:29 ERROR manager.CatalogQueryManager: Failed to list tables
com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link
failure
The last packet successfully received from the server was 1,469 milliseconds
ago. Th
e last packet sent successfully to the server was 1,464 milliseconds ago.
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
...
at org.apache.sqoop.Sqoop.main(Sqoop.java:240)
Caused by: javax.net.ssl.SSLHandshakeException:
sun.security.validator.ValidatorExcep
tion: PKIX path building failed:
sun.security.provider.certpath.SunCertPathBuilderExc
eption: unable to find valid certification path to requested target
...
[alice@sqoop01 ~]$
Unsurprisingly, this didn’t work, as Java’s standard certificate truststores
don’t include our MySQL server’s certificate as a trusted certificate. The key
error message to look for when diagnosing these kinds of trust issues is
unable to find valid certification path to requested target.
That basically means that there is no signing path from any of our trusted
certificates to the server’s certificate. The easiest way to remedy this is to
import the MySQL server’s certificate into a truststore and instruct the
MySQL JDBC driver to use that truststore when connecting, as shown in
Example 10-21.
Example 10-21. Listing tables with a local truststore
[alice@sqoop01 ~]$ keytool \
-import \
-alias mysql.example.com \
-file mysql.example.com.crt \
-keystore sqoop-jdbc.ts
Enter keystore password:
Re-enter new password:
Owner: EMAILADDRESS=admin@example.com, CN=mysql.example.com, O="Cluster,
Inc.",
L=San Francisco, ST=California, C=US
Issuer: EMAILADDRESS=admin@example.com, CN=mysql.example.com, O="Cluster,
Inc."
, L=San Francisco, ST=California, C=US
Serial number: d7f528349bee94f3
Valid from: Fri Jun 27 13:59:05 PDT 2014 until: Sat Jun 27 13:59:05 PDT 2015
Certificate fingerprints:
MD5: 38:9E:F4:D0:4C:14:A8:DF:06:EC:A5:59:76:D1:0C:21
SHA1: AD:D0:CB:E2:70:C1:89:83:22:32:DE:EF:E5:2B:E5:4F:7E:49:9E:0A
Signature algorithm name: SHA1withRSA
Version: 1
Trust this certificate? [no]: yes
Certificate was added to keystore
[alice@sqoop01 ~]$ URI="jdbc:mysql://mysql01.example.com/sqoop"
[alice@sqoop01 ~]$ URI="${URI}?verifyServerCertificate=true"
[alice@sqoop01 ~]$ URI="${URI}&useSSL=true"
[alice@sqoop01 ~]$ URI="${URI}&requireSSL=true"
[alice@sqoop01 ~]$ URI="${URI}&trustCertificateKeyStoreUrl=file:sqoop-
jdbc.ts
[alice@sqoop01 ~]$ URI="${URI}&trustCertificateKeyStoreType=JKS"
[alice@sqoop01 ~]$ URI="${URI}&trustCertificateKeyStorePassword=password"
[alice@sqoop01 ~]$ sqoop list-tables --connect ${URI} \
--username sqoop -P
Enter password:
cities
countries
normcities
staging_cities
visits
[alice@sqoop01 ~]$
Here, we first create a truststore with the MySQL server’s certificate and
then we point the MySQL JDBC driver to the truststore. This requires us to
set some additional parameters in the JDBC URI, namely:
trustCertificateKeyStoreUrl
A URL pointing to the location of the keystore used to verify the MySQL
server’s certificate.
trustCertificateKeyStoreType
The type of the keystore used to verify the MySQL server’s certificate.
trustCertificateKeyStorePassword
The password of the keystore used to verify the MySQL server’s
certificate.
Notice that we specify the location of the truststore with a relative file:
<URI>. This will become important in the next example. Now that we can list
tables, lets try doing an import in Example 10-22.
Example 10-22. Importing tables with a truststore
[alice@sqoop01 ~]$ URI="jdbc:mysql://mysql01.example.com/sqoop"
[alice@sqoop01 ~]$ URI="${URI}?verifyServerCertificate=true"
[alice@sqoop01 ~]$ URI="${URI}&useSSL=true"
[alice@sqoop01 ~]$ URI="${URI}&requireSSL=true"
[alice@sqoop01 ~]$ URI="${URI}&trustCertificateKeyStoreUrl=file:sqoop-
jdbc.ts"
[alice@sqoop01 ~]$ URI="${URI}&trustCertificateKeyStoreType=JKS"
[alice@sqoop01 ~]$ URI="${URI}&trustCertificateKeyStorePassword=password"
[alice@sqoop01 ~]$ sqoop import \
-files sqoop-jdbc.ts \
--connect ${URI} \
--username sqoop \
-P \
--table cities
Enter password:
...
14/06/30 10:57:13 INFO mapreduce.ImportJobBase: Retrieved 3 records.
[alice@sqoop01 ~]$
Not much has changed, except we’re using the same URI as the last list-
tables example and we’ve added a -files command-line argument. The -
files switch will place the list of files into Hadoop’s distributed cache. The
distributed cache copies the files to each node in the cluster and places them
in the working directory of the running task. This is useful, as it means our
specification for the trustCertificateKeyStoreUrl works for both the
local machine and all of the nodes where tasks are executed. That is why we
wanted the truststore to be in the working directory where we launched the
Sqoop job.
Encryption support is not limited to generic mode. In particular, MySQL’s
direct mode uses the mysqldump and mysqlimport tools, which support
SSL. Let’s see how we’d enable SSL in direct mode in Example 10-23.
Example 10-23. Importing tables with a truststore using direct mode
[alice@sqoop01 ~]$ URI="jdbc:mysql://mysql01.example.com/sqoop"
[alice@sqoop01 ~]$ URI="${URI}?verifyServerCertificate=true"
[alice@sqoop01 ~]$ URI="${URI}&useSSL=true"
[alice@sqoop01 ~]$ URI="${URI}&requireSSL=true"
[alice@sqoop01 ~]$ URI="${URI}&trustCertificateKeyStoreUrl=file:sqoop-
jdbc.ts"
[alice@sqoop01 ~]$ URI="${URI}&trustCertificateKeyStoreType=JKS"
[alice@sqoop01 ~]$ URI="${URI}&trustCertificateKeyStorePassword=password"
[alice@sqoop01 ~]$ sqoop import \
-files sqoop-jdbc.ts,mysql.example.com.crt \
--connect ${URI} \
--username sqoop \
-P \
--table cities \
--direct \
-- \
--ssl \
--ssl-ca=mysql.example.com \
--ssl-verify-server-cert
Enter password:
...
14/06/30 15:32:43 INFO mapreduce.ImportJobBase: Retrieved 3 records.
[alice@sqoop01 ~]$
Again, this is very similar to our last example. The main differences are that
we added mysql.example.com.crt to the -files switch so that the nodes
will have the PEM-formated certificate file that will be required by the
mysqldump tool. We also added the --direct switch to enable direct mode.
Finally, we added a -- switch followed by --ssl, --ssl-
ca=mysql.example.com, and --ssl-verify-server-cert. The -- switch
indicates that all following arguments should be passed to the tool that
implements direct mode. The rest of the arguments will be processed by
mysqldump to enable SSL, set the location of the CA certificate, and to tell
mysqldump to verify the MySQL server’s certificate.
One detail that we glossed over is where tools like Sqoop are launched from.
Typically, you want to limit the interfaces that users have access to. As
described in “Remote Access Controls”, there are numerous ways that
access to remote protocols can be secured, and the exact architecture will
depend on your needs. The most common way of limiting access to edge
services, including ingest, is to deploy services like Flume and Sqoop to
edge nodes. Edge nodes are simply servers that have access to both the
internal Hadoop cluster network and the outside world. Typical cluster
deployments will lock down access to specific ports on specific hosts
through the use of either host or network firewalls. In the context of data
ingest, we can restrict the ability to push data to a Hadoop cluster through the
edge nodes while still deploying pull-based mechanisms, such as Sqoop, to
perform parallel ingest without opening up access to sensitive Hadoop
services to the world.
Limiting remote login capabilities to only edge nodes goes a long way
toward mitigating the risk of having users be physically logged into a
Hadoop cluster. It allows you to concentrate your monitoring and security
auditing while at the same time reducing the population of potential bad
actors. When building production data flows, it’s common to set up dedicated
ETL accounts or groups that will execute the overall workflow. For
organizations that require detailed auditing, it’s recommended that actions be
initiated by individual user accounts to better track activity back to a person.
In addition to Flume and Sqoop, edge nodes may run proxy services or other
remote user protocols. For example, HDFS supports a proxy server called
HttpFS, which exposes a read/write REST interface for HDFS. Just as with
HDFS itself, HttpFS fully supports Kerberos-based authentication, and the
same authorization controls built into HDFS apply when it’s accessed
through HttpFS. Running HttpFS on an edge node can be useful for allowing
limited access to the data stored in HDFS and can even be used for certain
data ingest use cases.
Another common edge node service is Oozie. Ooze is a workflow execution
and scheduling tool. Complex workflows that combine Sqoop jobs, Hive
queries, Pig scripts, and MapReduce jobs can be composed into single units
and can be reliably executed and scheduled using Oozie. Oozie also provides
a REST interface that supports Kerberos-based authentication and can be
safely exposed to an edge node.
For some use cases, it is necessary to stage files on an edge node before they
are pushed into HDFS, HBase, or Accumulo. When creating these local disk
(or sometimes NFS-mounted) staging directories, it is important to use your
standard operating system controls to limit access to only those users
authorized to access the data. Again, it’s useful to define one or more ETL
groups and limit the access to raw data to these relatively trusted groups.
This is no different than when data warehousing systems were introduced to
the enterprise. In a typical deployment, applications are tightly coupled with
the transactional systems that back them. This makes security integration
straightforward because access to the backend database is typically limited
to the application that is generating and serving the data. Some attention to
security detail gets introduced as soon as that transactional data is important
enough to back up. However, this is still a relatively easy integration to
make, because the backups can be restricted to trusted administrators that
already have access to the source systems.
Where things get interesting is when you want to move data from these
transactional systems into analytic data warehouses so that analysis can be
performed independently of the application. Using a traditional data
warehouse system, you would compare the security configuration of the
transactional database with the features of the new data warehouse. This
works fine for securing that data once it’s in the warehouse and you can
apply the same analysis to the database-based authorization features
available in Sentry. However, care must be taken in how the data is handled
between the transactional system and the analysis platform.
With these traditional systems, this comes down to securing the ETL grid that
is used to load data into the data warehouse. It’s clear that the same
considerations that you make to your ETL grid would apply to the ingest
pipeline of a Hadoop cluster. In particular, you have to consider when and
where encryption was necessary to protect the confidentiality of data. You
need to pay close attention to how to maintain the integrity of your data. This
is especially true of traditional ETL grids that might not have enough storage
capacity to maintain raw data after it has been transformed. And lastly, you
care about the availability of the ETL grid to make sure that it does not
impact the ability of the source systems or data warehouse to meet the
requirements of their users. This is exactly the same process we went through
in our discussion of data ingest into Hadoop in general, and with Flume and
Sqoop in particular.
In this chapter, we focused on the movement of data from external sources to
Hadoop. After briefly talking about batch file ingest, we moved on to focus
on the ingestion of event-based data with Flume, and the ingestion of data
sourced from relational databases using Sqoop. What we found is that these
common mechanisms for ingest have the ability to protect the integrity of data
in transit. A key takeaway from this chapter is that the protection of data
inside the cluster needs to be extended all the way to the source of ingest.
This mode of protection should match the level in place at the source
systems.
Now that we have covered protection of both data ingestion and data inside
the cluster, we can move on to the final topic of data protection, which is to
secure data extraction and client access.
1 To make setup of dm-crypt/LUKS easier, you can use the cryptsetup tool.
Instructions for setting up dm-crypt/LUKS using cryptsetup are available on
the cryptsetup FAQ page.
2 The Sqoop examples are based on the Apache Sqoop Cookbook by
Kathleen Ting and Jarek Jarcec Cecho (O’Reilly). The example files and
scripts used are available from the Apache Sqoop Cookbook project page.
3 If SSL has not yet been configured for MySQL, you can follow the
instructions in the MySQL manual.
Chapter 11. Data Extraction and Client
Access Security
The most basic form of client access comes in the form of command-line tools. As we described in “Edge
Nodes”, it’s common for clusters to limit external access to a small set of edge nodes. Users use ssh to
remotely log into an edge node and then use various command-line tools to interact with the cluster. A
brief description of the most common commands is shown in Table 11-1.
Table 11-1. Common command-line tools for client access
Command Description
hdfs dfs -put <src> <dst> Copy a local file into HDFS
hdfs dfs -get <src> <dst> Download a file from HDFS to the local filesystem
hdfs dfs -cat <path> Print the contents of a file to standard out
hdfs dfs -ls <path> List the files and directories in a path
hdfs dfs -mkdir <path> Make a directory in HDFS
hdfs dfs -cp <src> <dst> Copy an HDFS file to a new location
hdfs dfs -mv <src> <dst> Move an HDFS file to a new location
hdfs dfs -rm <path> Remove a file from HDFS
hdfs dfs -rmdir <path> Remove a directory from HDFS
hdfs dfs -chgrp <group> <path> Change the group of a file or directory
hdfs dfs -chmod <mode> <path> Change the permissions on a file or directory
hdfs dfs -chown <owner>[:<group>] <path> Change the owner of a file or directory
yarn jar <jar> [<main-class>] <args> Run a JAR file, typically used to launch a MapReduce or other YARN job
yarn application -list List running YARN applications
yarn application -kill <app-id> Kill a YARN application
mapred job -list List running MapReduce jobs
mapred job -status <job-id> Get the status of a MapReduce job
mapred job -kill <job-id>
Kill a MapReduce job
Command Description
hive Start a Hive SQL shell (deprecated; use beeline instead)
beeline Start a SQL shell for Hive or Impala
impala-shell Start an Impala SQL shell
hbase shell Start an HBase shell
accumulo shell Start an Accumulo shell
oozie job Run, inspect, and kill Oozie jobs
sqoop export Export a table from HDFS to a database
sqoop import Import a table from a database to HDFS
The core Hadoop command-line tools (hdfs, yarn, and mapred) only support Kerberos or delegation
tokens for authentication. The easiest way to authenticate these commands is to obtain your Kerberos
ticket-granting ticket1 using kinit before executing a command. If you don’t obtain your TGT before
executing a Hadoop command, you’ll see an error similar to Example 11-1. In particular, you’re looking
for the message failed to find any Kerberos tgt.
Example 11-1. Executing a Hadoop command with no Kerberos ticket-granting ticket
[alice@hadoop01 ~]$ hdfs dfs -cat movies.psv
cat: Failed on local exception: java.io.IOException: javax.security.sasl.SaslExc
eption: GSS initiate failed [Caused by GSSException: No valid credentials provid
ed (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local hos
t is: "hadoop01.example.com/172.25.2.196"; destination host is: "hadoop02.exampl
e.com":8020;
Now let’s see what happens after Alice first obtains her TGT using kinit (Example 11-2).
Example 11-2. Executing a Hadoop command after kinit
[alice@hadoop01 ~]$ kinit
Password for alice@EXAMPLE.COM:
[alice@hadoop01 ~]$ hdfs dfs -cat movies.psv
1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(
...
This time the command completes successfully and prints the contents of the file movies.psv. One of the
advantages to using Kerberos for authentication is that the user doesn’t need to authenticate individually
for each command. If you kinit at the beginning of your session or have your Linux system configured to
obtain your Kerberos TGT during login, then you can run any number of Hadoop commands and all of the
authentication will happen behind the scenes.
subsequent HDFS commands by setting the HADOOP_TOKEN_FILE_LOCATION environment variable. See
Example 11-3 for an example using delegation tokens.
Example 11-3. Executing a Hadoop command using delegation tokens
[alice@hadoop01 ~]$ kinit
Password for alice@EXAMPLE.COM:
[alice@hadoop01 ~]$ hdfs fetchdt --renewer alice nn.dt
14/10/21 19:19:32 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 2 for
alice on 172.25.3.210:8020
Fetched token for 172.25.3.210:8020 into file:/home/alice/nn.dt
[alice@hadoop01 ~]$ kdestroy
[alice@hadoop01 ~]$ export HADOOP_TOKEN_FILE_LOCATION=nn.dt
[alice@hadoop01 ~]$ hdfs dfs -cat movies.psv
1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(
...
Delegation tokens are completely separate from Kerberos-based authentication. Once issued, a delegation
token is valid for 24 hours by default and can be renewed for up to 7 days. You can change how long the
token is initially valid for by setting dfs.namenode.delegation.token.renewal-interval, which is
expressed as the amount of time in milliseconds the token is valid before needing to be renewed. You can
set dfs.namenode.delegation.token.max-lifetime to change the max renewal lifetime of a token.
This setting is also in milliseconds. These tokens are separate from the Kerberos system, so if your
The Hadoop command-line tools don’t have their own authorization model. Rather, they rely on the
cluster configuration to control what users can access. For a refresher on Hadoop authorization, refer
back to “HDFS Authorization”, “Service-Level Authorization”, and “MapReduce and YARN
Authorization”.
Closely related to where authorization decisions are made is how deep do user accounts go? Historically,
databases have maintained their own private identity directories. This often made it complicated to try
and replicate each user in a corporate directory in the database. Accumulo still uses its own identity
directory so it shares this drawback. In response to this, many application developers adopted a pattern
where the database would store application-level accounts and it was the application’s responsibility to
downgrade its access to take advantage of database-level authorizations.
The ability to downgrade access is why Accumulo requires users to pass in a list of authorizations when
accessing Accumulo via the Java API. This allows an application to perform authentication with the end
user and to look up the user’s authorizations using a central service. The application will then pass these
end user authorizations when reading data. Accumulo will automatically intersect the application’s
authorizations with the end user’s authorizations. This means that you can control the maximum level of
data that an application can access while still providing end users with finer-grained access based on
their individual level.
In Chapter 5, we saw how to configure HBase to use Kerberos for authentication. HBase clients can
access HBase via the shell, the Java API, or through one of the HBase gateway servers. All of the client
access APIs support Kerberos for authentication and require that the user first obtain a Kerberos TGT
before connecting.
The method of client access depends on your use case. For database administrative access such as
creating, modifying, or deleting tables, the HBase shell is commonly used. When using MapReduce or
another data processing framework, access is through the Java API. Other types of HBase applications
may use the Java API directly or access HBase through a gateway. It’s especially common to use one of
the gateway APIs when accessing HBase from a language other than Java. The gateways also provide a
choke point for administrators to restrict direct access to HBase. Next, we’ll see how to securely interact
with HBase via the shell before discussing how to configure the HBase gateways with security.
When using the shell, the user typically obtains their TGT by executing kinit. If you try to run the shell
without running kinit, you’ll see something similar to Example 11-4. What you’re looking for is again
the Failed to find any Kerberos tgt at the end of the stack trace.
Example 11-4. Using the HBase shell with no Kerberos ticket-granting ticket
[alice@hadoop01 ~]$ hbase shell
14/11/13 14:45:53 INFO Configuration.deprecation: hadoop.native.lib is depre
cated. Instead, use io.native.lib.available
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.98.6, rUnknown, Sat Oct 11 15:15:15 PDT 2014
hbase(main):001:0> list
TABLE
14/11/13 14:46:00 WARN ipc.RpcClient: Exception encountered while connecting
to the server : javax.security.sasl.SaslException: GSS initiate failed [Cau
sed by GSSException: No valid credentials provided (Mechanism level: Failed
to find
any Kerberos tgt)]
14/11/13 14:46:00 FATAL ipc.RpcClient: SASL authentication failed. The most
likely cause is missing or invalid credentials. Consider 'kinit'.
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSExcepti
on: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
...
ERROR: No valid credentials provided (Mechanism level: Failed to find any Ke
rberos tgt)
Here is some help for this command:
List all tables in hbase. Optional regular expression parameter could
be used to filter the output. Examples:
hbase> list
hbase> list 'abc.*'
hbase> list 'ns:abc.*'
hbase> list 'ns:.*'
hbase(main):002:0>
Now let’s try that again in Example 11-5, but this time we’ll obtain our TGT using kinit before
executing the shell.
Example 11-5. Using the HBase shell after kinit
[alice@hadoop01 ~]$ kinit
Password for alice@EXAMPLE.COM:
[alice@hadoop01 ~]$ hbase shell
14/11/13 14:53:56 INFO Configuration.deprecation: hadoop.native.lib
is deprecated. Instead, use io.native.lib.available
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.98.6, rUnknown, Sat Oct 11 15:15:15 PDT 2014
hbase(main):001:0> list
TABLE
analytics_demo
document_demo
row(s) in 3.1900 seconds
=> ["analytics_demo", "document_demo"]
hbase(main):002:0> whoami
alice@EXAMPLE.COM (auth:KERBEROS)
groups: alice, hadoop-users
hbase(main):003:0>
The HBase shell doesn’t have unique authorization configuration and all access will be authorized per the
configuration of HBase authorization. See “HBase and Accumulo Authorization” for a refresher on HBase
authorization.
The first step is to create a Kerberos principal for the REST gateway to talk to the rest of HBase. This is
a service principal and should include the hostname of the server running the REST gateway—for
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
example, where is replaced with the fully
qualified domain name of the server the REST gateway is run on. After creating the principal and
exporting a keytab file with the principal’s key, you need to configure the REST server to use Kerberos to
talk to a secure HBase cluster. Let’s set the following in the hbase-site.xml file:

If HBase authorization is turned on, you also need to create a top-level ACL for the principal the REST
server is using. Assuming you want to grant everything (including administrative access) through the
REST gateway, then you would use the HBase shell to execute the following (see “HBase and Accumulo
Authorization” for a refresher on HBase authorization):
![]()
![]()
![]()
![]()
![]()
![]()
![]()
If you used a different principal name, then replace with the short name for your principal. The next
step is to enable authentication with REST clients through SPNEGO/Kerberos. Per the SPNEGO
specification, you need to create a principal with the format
![]()
![]()
![]()
where is replaced with the fully qualified domain name of the server the REST
gateway is run on. Let’s set the following in hbase-site.xml to turn on authentication:
![]()
![]()

![]()
In this example, we configured the REST authentication keytab to the same location as the HBase
authentication keytab. This means that you either need to export both keys at the same time or use
to combine the keys for both principals into a single keytab file. Alternatively, you can use different
keytab files for the REST client authentication and the HBase authentication.
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
The REST server always authenticates with HBase using the , but it
will perform actions on behalf of the user that authenticated with the REST server. In order to do this, the
REST server must have privileges to impersonate other users. We can use the same
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
and settings we described in “Impersonation”.
As a refresher, these settings control which users the proxy user can impersonate and which hosts they can
impersonate from. The values of those settings are comma-separated lists of the groups and hosts,
respectively, or to mean all groups/hosts. For example, if you want the rest user to impersonate any
users in the hbase-users group from any host, you’d add the following on the hbase-site.xml file on the
HBase Master:
![]()
![]()

![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
The REST server also supports remote REST clients impersonating end users. This is called two-level
user impersonation because the REST client impersonates a user who is then impersonated by the REST
server. This lets you run an application that accesses HBase through the REST server where the
application can pass user credentials all the way to HBase. This level of impersonation is enabled by
setting to true. You can control which end users an application
accessing the REST server can impersonate by setting the
configuration setting. Let’s say we have an application called whizbang that can impersonate any of the
users in the whizbang-users group. We would set the following in the hbase-site.xml file on the REST
server:
![]()

![]()
![]()
![]()
![]()
![]()
Figure 11-1 shows how two-level impersonation works through the HBase REST server. The end user,
Alice, authenticates with an LDAP username and password to prove her identity to Hue (1). Hue then
authenticates with Kerberos using the principal and passing a
![]()
![]()
user of (2). Finally, the HBase REST server authenticates with Kerberos using the
![]()
![]()
![]()
![]()
![]()
![]()
![]()
principal and passing a user of (3). This
effectively propagates Alice’s credentials all the way from the user to HBase.
Figure 11-1. Two-level user impersonation

![]()
![]()
![]()
![]()
![]()
The REST server supports encrypting the connection between clients and the REST server by enabling
TLS/SSL. We can enable SSL by setting to and configuring the REST
server to use a Java keystore file with the private key and certificate. If our keystore is in
![]()
![]()
/etc/hbase/conf/rest.example.com.jks and the key and keystore use the password , then we’d set
the following in the hbase-site.xml file on the REST server:
![]()

If the REST server certificate isn’t signed by a trusted certificate, then you need to import the certificate
into the Java central truststore using the
command-line tool:
![]()
![]()
WARNING
The preceding keytool command imports a certificate into Java’s central trusted certificates store. That means that any
certificate you import will be trusted by any Java application—not just HBase—that is using the given JRE.
HBase Thrift Gateway
Like the REST gateway, the HBase Thrift gateway supports user authentication with Kerberos. The first
step is to create a Kerberos principal for the Thrift gateway to talk to the rest of HBase. This is a service
principal and should include the hostname of the server running the Thrift gateway (e.g.,
![]()
![]()
![]()
![]()
![]()
![]()
). After creating the principal and exporting a keytab file
with the principal’s key, you need to configure the Thrift server to use Kerberos to talk to a secure HBase
cluster. Let’s set the following in the hbase-site.xml file:

If HBase authorization is turned on, you also need to create a top-level ACL for the principal the Thrift
server is using. Assuming you want to grant everything, including administrative access, through the Thrift
gateway, then you would use the HBase shell to execute the following:
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
At this point, the Thrift gateway will be able to access a secure HBase cluster but won’t do user
authentication. You need to set to one of the following three values to
enable authentication:
![]()
Enable authentication
![]()
![]()
Enable authentication and integrity checking
![]()
![]()
Enable authentication, confidentiality (encryption), and integrity checking
![]()
![]()
![]()
As with the REST gateway, we need to enable the Thrift user to impersonate the users that authenticate
with the Thrift gateway. Again, we’ll use the settings in the HBase Master’s hbase-file:
![]()

![]()
![]()
![]()
![]()

![]()
![]()
![]()
Unlike the REST gateway, the Thrift gateway does not support application users impersonating end users.
This means that if an application is accessing HBase through the Thrift gateway, then all access will
proceed as the app user. In the upcoming HBase 1.0, the ability for the Thrift gateway to do impersonation
is being added when the Thrift gateway is configured to use HTTPS as the transport. The work for this is
being tracked in HBASE-12640.
Figure 11-2 shows how impersonation works with the Thrift gateway. Suppose you have a web
application that uses the Thrift gateway to access HBase. The user, Alice, authenticates with the web
application using PKI (1). The application then authenticates with the Thrift gateway using Kerberos (2).
Because one level of impersonation is supported, the Thrift gateway authenticates with HBase using the
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
principal and a of (3). HBase won’t know that
the original end user was Alice and it will be up to the web application to apply additional authorization
controls before showing results to Alice. Take a look back at Figure 11-1 and compare and contrast one-
level and two-level user impersonation.
Figure 11-2. Thrift gateway application-level impersonation

Accumulo
Accumulo client access can be achieved with two mechanisms: the shell and the proxy server. The
Accumulo shell is similar to the HBase shell, whereas the proxy server is similar to the HBase thrift
server gateway.
Unlike HBase, Accumulo uses usernames and passwords for authentication. Support for Kerberos
authentication of clients is coming in Accumulo 1.7.0 and is tracked in ACCUMULO-1815. That means
that clients must provide both when connecting to Accumulo. When using the Accumulo shell, you can
pass a username with the -u or --user command-line parameters or you can let it default to the Linux
username where you’re running the shell. If you don’t pass any parameters, Accumulo will prompt you for
the password on stdin. Alternatively, you can have the password provided on the command line, in a file,
or in an environment variable. These methods are enabled by passing the -p or --password parameters
with an option of pass:<literal password>, file:<path to file with password>, or env:
<environment variable with password>, respectively. You can also change the user after launching
the Accumulo shell using the user command. The user will be prompted for their password. See
Example 11-6 to see the various methods of passing a password to the Accumulo shell. Notice that when
the wrong password is provided, the shell will print the message Username or Password is Invalid.
Example 11-6. Authenticating with the Accumulo shell
[alice@hadoop01 ~]$ accumulo shell
Password: ***
2014-11-13 15:19:54,225 [shell.Shell] ERROR: org.apache.accumulo.core.client
.AccumuloSecurityException: Error BAD_CREDENTIALS for user alice - Username
or Password is Invalid
[alice@hadoop01 ~]$ accumulo shell
Password: ******
Shell - Apache Accumulo Interactive Shell
-
version: 1.6.0
instance name: accumulo
- instance id: 382edcfb-5078-48b4-8570-f61d92915015
-
type 'help' for a list of available commands
-
alice@accumulo> quit
[alice@hadoop01 ~]$ accumulo shell -p pass:secret
Shell - Apache Accumulo Interactive Shell
-
version: 1.6.0
instance name: accumulo
- instance id: 382edcfb-5078-48b4-8570-f61d92915015
-
type 'help' for a list of available commands
-
alice@accumulo> quit
[alice@hadoop01 ~]$ accumulo shell -p file:accumulo_pass.txt
Shell - Apache Accumulo Interactive Shell
-
version: 1.6.0
instance name: accumulo
- instance id: 382edcfb-5078-48b4-8570-f61d92915015
-
type 'help' for a list of available commands
-
alice@accumulo> quit
[alice@hadoop01 ~]$ accumulo shell -p env:ACCUMULO_PASS
Shell - Apache Accumulo Interactive Shell
-
version: 1.6.0
instance name: accumulo
- instance id: 382edcfb-5078-48b4-8570-f61d92915015
-
type 'help' for a list of available commands
-
alice@accumulo> user bob
Enter password for user bob: ***
bob@accumulo>
The Accumulo shell doesn’t have unique authorization configuration and all access will be authorized per
the configuration of Accumulo authorization. See “HBase and Accumulo Authorization” for a refresher on
Accumulo authorization.
$ACCUMULO_HOME/proxy/proxy.properties file. The protocolFactory determines the underlying
Thrift protocol that will be used by the server and the clients. Changing this setting must be coordinated
with the protocol implementation that clients are using. If you need to support multiple Thrift protocols,
you should deploy multiple proxy servers.
The other setting that must be synced between clients and the proxy server is the tokenClass. The proxy
server doesn’t authenticate clients directly and instead passes the authentication token provided by the
user to Accumulo for authentication. If you need to support multiple types of authentication tokens
simultaneously, you need to deploy multiple proxy servers.
An example proxy.properties file is shown here:
protocolFactory=org.apache.thrift.protocol.TCompactProtocol$Factory
tokenClass=org.apache.accumulo.core.client.security.tokens.PasswordToken
port=42424
instance=accumulo-instance
zookeepers=zoo-1.example.com,zoo-2.example.com,zoo-3.example.com
Because Accumulo passes the authentication token from the application accessing the proxy server to
Accumulo, you get the equivalent of one level of impersonation, as shown in Figure 11-2. Accumulo
supports downgrading the access of the application user, so it’s possible for the web application to look
up Alice’s authorizations and have Accumulo’s authorization filter limit access to data that Alice is
authorized for.
Oozie is a very important tool from a client access perspective. In addition to being the workflow
executor and scheduler for your cluster, Oozie can be used as a gateway service for clients to submit any
type of job. This allows you to shield direct access to your YARN or MR1 servers from clients while
still allowing remote job submission. If you choose to use this style of architecture, it’s very important to
secure your Oozie server by enabling authentication and authorization. We previously described how to
configure Kerberos authentication in “Oozie” while authorization was detailed in “Oozie Authorization”.
Once those features are enabled on the server side, they can be used by clients with little additional
configuration. To authenticate with Oozie, you simply need to have a Kerberos TGT cached on your
workstation. This can easily be handled by running kinit before issuing an Oozie command from the
command line. If you run an Oozie command and see an error that says Failed to find any Kerberos
tgt, then you probably didn’t run kinit:
[alice@edge01 ~]$ oozie jobs -oozie http://oozie01.example.com:11000/oozie
Error: AUTHENTICATION : Could not authenticate, GSSException: No valid crede
ntials provided (Mechanism level: Failed to find any Kerberos tgt)
[alice@edge01 ~]$ klist
klist: No credentials cache found (ticket cache FILE:/tmp/krb5cc_1236000001)
[alice@edge01 ~]$ kinit
Password for alice@EXAMPLE.COM:
[alice@edge01 ~]$ oozie jobs -oozie http://oozie01.example.com:11000/oozie
No Jobs match your criteria!
While we’ve configured Oozie with authentication and authorization, we haven’t done anything to
guarantee confidentiality of the communication between the Oozie client and the Oozie server.
Fortunately, Oozie supports using HTTPS to encrypt the connection and provide integrity checks. In order
to enable HTTPS, you must get a certificate issued to the Oozie server by your certificate authority. See
“Flume Encryption” for an example of creating a self-signed certificate.
Once your certificate authority has issued a certificate and you have the certificate and private key in a
PKCS12 file, you can import the certificate and private key into a Java keystore file. In the following
example, we use the same pass phrase, secret, for both the keystore and the certificate’s private key:
[root@oozie01 ~]# mkdir /etc/oozie/ssl
[root@oozie01 ~]# keytool -v -importkeystore \
-srckeystore /etc/pki/tls/private/oozie01.example.com.p12 \
-srcstoretype PKCS12 \
-destkeystore /etc/oozie/ssl/oozie01.example.com.keystore -deststoretype JKS \
-deststorepass secret -srcalias oozie01.example.com -destkeypass secret
Enter source keystore password:
[Storing /etc/oozie/ssl/oozie01.example.com.keystore]
[root@oozie01 ~]# chown -R oozie:oozie /etc/oozie/ssl
[root@oozie01 ~]# chmod 400 /etc/oozie/ssl/*
[root@oozie01 ~]# chmod 700 /etc/oozie/ssl
Next, set the environment variables that control the keystore location and password in the oozie-env.sh
file:
export OOZIE_HTTPS_KEYSTORE_FILE=/etc/oozie/ssl/oozie01.example.com.keystore
export OOZIE_HTTPS_KEYSTORE_PASS=secret
WARNING
The keystore password used here will be visible to anyone that can perform a process listing on the server running Oozie. You
must protect the keystore file itself with strong permissions to prevent users from reading or modifying the keystore.
Before you configure Oozie to use HTTPS, you need to make sure the Oozie server isn’t running. To
configure Oozie to use HTTPS, run the following command:
[oozie@oozie01 ~]$ oozie-setup.sh prepare-war -secure
Now if you start the server it will use HTTPS over port 11443. The port can be changed by setting the
OOZIE_HTTPS_PORT environment variable in the oozie-env.sh file.
On client machines that will be accessing Oozie, you can simply change the Oozie URL to
https://oozie01.example.com:11443/oozie on the command line. For example:
[alice@edge01 ~]$ oozie jobs -oozie https://oozie01.example.com:11443/oozie
No Jobs match your criteria!
If you get a SSLHandshakeException error instead of the expected output as shown here:
[alice@edge01 ~]$ oozie jobs -oozie https://oozie01.example.com:11443/oozie
Error: IO_ERROR : javax.net.ssl.SSLHandshakeException: sun.security.validato
r.ValidatorException: PKIX path building failed: sun.security.provider.certp
ath.SunCertPathBuilderException: unable to find valid certification path to
requested target
Then it means your Oozie server is using a certificate that isn’t signed by a trusted certificate authority.
This can happen if you’re using a self-signed certificate or an internal CA that isn’t signed by one of the
root CAs. Let’s say we have a certificate authority for the EXAMPLE.COM realm in a file called example-. Because we’ll be importing this into Java’s central truststore, this certificate will be trusted by all
Java applications running on this server, not just Oozie. We can import the certificate into Java’s central
truststore using the following command:
[root@edge01 ~]# keytool -import -alias EXAMPLE.COM -file example-ca.crt \
-keystore ${JAVA_HOME}/jre/lib/security/cacerts
Enter keystore password:
Owner: CN=Certificate Authority, O=EXAMPLE.COM
Issuer: CN=Certificate Authority, O=EXAMPLE.COM
Serial number: 1
...
Trust this certificate? [no]: yes
Certificate was added to keystore
The default password for the Java cacerts file is changeit. If Oozie is configured with HA, then you
need to configure your load balancer to do TLS pass-through. This will allow clients to see the certificate
presented by the Oozie servers and won’t require the load balancer to have its own certificate. When
you’re doing TLS pass-through, you should either use a wildcard certificate or certificates with subject
alternate names that include the load balancer’s fully qualified domain name as a valid name.
In Chapter 10, we discussed how to protect the confidentiality, integrity, and availability of your data
ingest pipeline. The same principles hold for securing data extraction pipelines. In the case of Sqoop,
confidentiality isn’t provided by Sqoop itself but it may be provided by the drivers that Sqoop uses to talk
to an RDBMS server. In “Sqoop Encryption”, we showed how you can configure the MySQL driver to
use SSL to encrypt traffic between the MySQL server and the tasks executed by Sqoop. You can use the
same parameters to encrypt the data during an export as shown in Example 11-7.
Example 11-7. Exporting a MySQL table over SSL
[alice@sqoop01 ~]$ hdfs dfs -cat cities/*
1,USA,Palo Alto
2,Czech Republic,Brno
3,USA,Sunnyvale
[alice@sqoop01 ~]$ URI="jdbc:mysql://mysql01.example.com/sqoop"
[alice@sqoop01 ~]$ URI="${URI}?verifyServerCertificate=false"
[alice@sqoop01 ~]$ URI="${URI}&useSSL=true"
[alice@sqoop01 ~]$ URI="${URI}&requireSSL=true"
[alice@sqoop01 ~]$ sqoop export --connect ${URI} \
--username sqoop -P --table cities \
--export-dir cities
Enter password:
...
14/06/28 17:27:22 INFO mapreduce.ExportJobBase: Exported 3 records.
[alice@sqoop01 ~]$
As described in Chapter 1, there are two popular ways for accessing Hadoop data using SQL: Hive and
Impala. Both Hive and Impala support both Kerberos and LDAP-based username/password
authentication. Users don’t typically interact with Hive or Impala directly and instead rely on SQL shells
or JDBC drivers.
Using Impala with Kerberos authentication
When Kerberos is enabled for communication with Hadoop, then Kerberos-based client authentication is
automatically enabled. Impala uses command-line parameters for configuration. You should set the --
principal and --keytab_file parameters on the impalad, statestored, and catalogd daemons. The
--principal should be set to the Kerberos principal that Impala uses for authentication. This will
typically be of the format impala/<fully qualified domain name>@<realm> where <fully
qualified domain name> is the host name of the server running impalad and <realm> is the Kerberos
realm. The first component of the principal, impala, must match the name of the user starting the Impala
process. The --keytab_file parameter must point to a keytab file that contains the previously mentioned
principal and the HTTP principal for the server running impalad. You can create a keytab file with both
principals from two independent keytabs using the ktutil command, as shown in Example 11-8.
Example 11-8. Merging the Impala and HTTP keytabs
[impala@impala01 ~]$ ktutil
ktutil: rkt impala.keytab
ktutil: rkt http.keytab
ktutil: wkt impala-http.keytab
ktutil: quit
To make it easier to configure, you can set the command-line parameters that the impalad process uses by
setting the IMPALA_SERVER_ARGS, IMPALA_STATE_STORE_ARGS, and IMPALA_CATALOG_ARGS variables
in the /etc/default/impala file. Example 11-9 shows how to enable Kerberos for Impala.
Example 11-9. Configuring Impala with Kerberos authentication
IMPALA_SERVER_ARGS="${IMPALA_SERVER_ARGS} \
--principal=impala/impala01.example.com@EXAMPLE.COM \
--keytab_file=/etc/impala/conf/impala-http.keytab"
IMPALA_STATE_STORE_ARGS="${IMPALA_STATE_STORE_ARGS} \
--principal=impala/impala01.example.com@EXAMPLE.COM \
--keytab_file=/etc/impala/conf/impala-http.keytab"
IMPALA_CATALOG_ARGS="${IMPALA_CATALOG_ARGS} \
--principal=impala/impala01.example.com@EXAMPLE.COM \
--keytab_file=/etc/impala/conf/impala-http.keytab"
If users access Impala behind a load balancer, then the configuration changes slightly. When building the
combined keytab, you also need to include the keytab for the proxy server principal and you need to add
the --be_principal parameter. The --be_principal is the principal that Impala uses for talking to
backend services like HDFS. This should be set to the same value that --principal was set to before,
and --principal should be changed to the principal for the load balancer. If your load balancer is on the
impala-proxy.example.com server then you would set the IMPALA_SERVER_ARGS as shown in
Example 11-10.
Example 11-10. Configuring Impala behind a load balancer with Kerberos authentication
IMPALA_SERVER_ARGS="${IMPALA_SERVER_ARGS} \
--principal=impala/impala-proxy.example.com@EXAMPLE.COM \
--be_principal=impala/impala01.example.com@EXAMPLE.COM \
--keytab_file=/etc/impala/conf/impala-http.keytab"
Impala supports using YARN for resource management via a project called Llama. Llama mediates
resource management between YARN and low-latency execution engines such as Impala. Llama has two
components, a long-running application master and a node manager plug-in. The application master
handles reserving resources for Impala while the node manager plug-in coordinates with the local Impala
daemon regarding changes to available resources on the local node.
When enabling Kerberos for Impala, you must also configure Kerberos for Llama by configuring the
following properties in the llama-site.xml file.
lama.am.server.thrift.security
Set to true to enable Thrift SASL/Kerberos-based security for the application master.
llama.am.server.thrift.security.QOP
Set the quality of protection when security is enabled. Valid values are auth for authentication only,
auth-int for authentication and integrity, and auth-conf for authentication, integrity, and
confidentiality (encryption).
llama.am.server.thrift.kerberos.keytab.file
Set the location of the application master keytab file. If this is a relative path, then it it is looked up
under the Llama configuration directory.
llama.am.server.thrift.kerberos.server.principal.name
The fully qualified principal name for the Llama application server. This setting must include both the
short name and the fully qualified hostname of the server running the Llama application master.
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
The short name used for client notifications. This short name is combined with the client hostname
provided by the impalad process during registration. You can override the hostname that the impalad
process registers with by configuring the parameter in the
variable.
Example 11-11 shows a snippet of the llama-site.xml file configured to enable Kerberos security.
Example 11-11. Configuring Llama application master with Kerberos authentication
![]()
![]()
![]()
![]()
![]()

![]()
![]()
![]()
Once Impala is configured to use Kerberos authentication, then clients can authenticate by having a
cached Kerberos TGT (i.e., running before executing the shell). Example 11-12 shows Alice
obtaining her Kerberos TGT and then authenticating with Kerberos using the Impala shell.
Example 11-12. Impala shell with Kerberos authentication

![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
When using Kerberos authentication with JDBC drivers, you need to first obtain a Kerberos TGT and
then include the name of the Impala principal in the connection string. Example 11-13 shows how to set
up the JDBC connection string for Kerberos authentication with Impala.
Example 11-13. JDBC connection string for Kerberos authentication
// Start with the basic JDBC URL string
String url = "jdbc:hive2://impala-proxy.example.com:21050/default";
// Add the Impala Kerberos principal name, including the FQDN and realm
url = url + ";principal=impala/impala-proxy.example.com@EXAMPLE.COM";
// Create the connection from the URL
Connection con = DriverManager.getConnection(url);
Using Impala with LDAP/Active Directory authentication
You can enable LDAP authentication by setting the --enable_ldap_auth and --ldap_uri parameters
on the impalad daemons. When configuring LDAP authentication you can optionally set bind parameters
depending on the type of LDAP provider you’re using. If you’re using Active Directory, you often don’t
need additional configuration, but you can explicitly set the domain name so that the username used to
bind to AD will be passed as user@<domain name>. The domain name is set by specifying the --
ldap_domain parameter. For OpenLDAP or freeIPA, you can configure a base distinguished name and the
username used to bind to LDAP will be passed as uid=user,<base dn>. This setting is enabled by
specifying the --ldap_baseDN parameter. If your LDAP provider doesn’t use uid=user to specify the
username in distinguished names, then you can provide a pattern that will become the distinguished name.
The pattern works by replacing all instances of #UID with the username prior to binding. This setting is
enabled by specifying the --ldap_bind_pattern parameter.
Regardless of the LDAP provider, it’s strongly recommend that you use TLS to encrypt the connection
between Impala and the LDAP server. This can be done by either using an ldaps:// URL or by enabling
StartTLS. You can enable StartTLS by setting the --ldap_tls parameter to true. For either mode, you
have to configure the certificate authority (CA) certificate so that Impala trusts the certificate used by the
LDAP server. You can set the --ldap_ca_certificate parameter to configure the location of the CA
certificate.
Refer to Examples 11-14 through pass:[11-16 for sample configuration when using Active Directory,
OpenLDAP, and custom LDAP providers, respectively.
Example 11-14. Configuring Impala with Active Directory
IMPALA_SERVER_ARGS="${IMPALA_SERVER_ARGS} \
--enable_ldap_auth=true \
--ldap_uri=ldaps://ad.example.com \
--ldap_ca_certificate=/etc/impala/pki/ca.crt \
--ldap_domain=example.com"
Example 11-15. Configuring Impala with OpenLDAP
IMPALA_SERVER_ARGS="${IMPALA_SERVER_ARGS} \
--enable_ldap_auth=true \
--ldap_uri=ldaps://ldap.example.com \
--ldap_ca_certificate=/etc/impala/pki/ca.crt \
--ldap_baseDN=ou=People,dc=example,dc=com"
Example 11-16. Configuring Impala with other LDAP provider
IMPALA_SERVER_ARGS="${IMPALA_SERVER_ARGS} \
--enable_ldap_auth=true \
--ldap_uri=ldaps://ldap.example.com \
--ldap_ca_certificate=/etc/impala/pki/ca.crt \
--ldap_bind_pattern=user=#UID,ou=users,dc=example,dc=com"
Specify the -l command-line option to the impala-shell to connect to Impala using LDAP
authentication. You’ll be prompted for the user password before the connection is complete. If you want
to connect as a user other than current Linux user, you can specify the -u option to change the username.
Example 11-17 shows how to authenticate using LDAP with the Impala shell.
Example 11-17. Impala shell with LDAP/Active Directory authentication
[alice@hadoop01 ~]$ impala-shell -i impala-proxy -l
Starting Impala Shell using LDAP-based authentication
LDAP password for alice:
Connected to impala-proxy:21000
Server version: impalad version 2.0.0 RELEASE (build ecf30af0b4d6e56ea80297d
f2189367ada6b7da7)
Welcome to the Impala shell. Press TAB twice to see a list of available commands.
Copyright (c) 2012 Cloudera, Inc. All rights reserved.
(Shell build version: Impala Shell v2.0.0 (ecf30af) built on Sat Oct 11 13:5
6:06 PDT 2014)
[impala-proxy:21000] > show tables;
Query: show tables
+-----------+
| name |
+-----------+
| sample_07 |
| sample_08 |
+-----------+
Fetched 2 row(s) in 0.16s
[impala-proxy:21000] >
If you’re connecting to Impala using JDBC drivers, then you pass the username and password to the
DriverManager when getting a connection. Example 11-18 shows how to connect using the JDBC driver
with LDAP authentication.
Example 11-18. JDBC connection string for LDAP/Active Directory authentication
// Use the basic JDBC URL string
String url = "jdbc:hive2://impala-proxy.example.com:21050/default";
// Create the connection from the URL passing in the username and password
Connection con = DriverManager.getConnection(url, "alice", "secret");
Using SSL wire encryption with Impala
The methods described so far have covered different ways for clients to authenticate with Impala. It is
also important to set up a protected channel for data transfers between clients and Impala. This is even
more critical when the data processed by Impala is sensitive, such as data that requires at-rest encryption.
Impala supports SSL wire encryption for this purpose. Example 11-19 shows the necessary startup flags.
Example 11-19. Configuring Impala with SSL
IMPALA_SERVER_ARGS="${IMPALA_SERVER_ARGS} \
--ssl_client_ca_certificate=/etc/impala/ca.cer \
--ssl_private_key=/etc/impala/impala.key \
--ssl_server_certificate=/etc/impala/impala.cer
The ssl_private_key, ssl_server_certificate, and ssl_client_ca_certificate paths must all
be readable by the impala user, and the certificates must be in PEM format. It is recommended to restrict
the permissions of the private key to 400.
When Impala is set up with SSL, clients must also know how to connect properly. The --ssl option tells
the impala-shell to enable SSL for the connection, and the --ca_cert argument specifies the certificate
authority chain (in PEM format) to use to verify the certificate presented by the Impala daemon you are
connecting to. Example 11-20 shows what this looks like when using both Kerberos authentication and
SSL wire encryption.
Example 11-20. Impala shell with SSL and Kerberos
alice@hadoop01 ~]$ impala-shell -i impala-proxy -k --ssl --ca_cert /etc/impala/ca.pem
Starting Impala Shell using Kerberos authentication
SSL is enabled
Connected to impala-proxy:21000
Server version: impalad version 2.0.0 RELEASE (build ecf30af0b4d6e56ea80297d
f2189367ada6b7da7)
Welcome to the Impala shell. Press TAB twice to see a list of available commands.
Copyright (c) 2012 Cloudera, Inc. All rights reserved.
(Shell build version: Impala Shell v2.0.0 (ecf30af) built on Sat Oct 11 13:5
6:06 PDT 2014)
[impala-proxy:21000] > show tables;
Query: show tables
+-----------+
| name |
+-----------+
| sample_07 |
| sample_08 |
+-----------+
Fetched 2 row(s) in 0.16s
[impala-proxy:21000] >
The old, deprecated Hive command-line tool, hive, does not support direct authentication or
authorization with Hive. Instead, it either directly accesses data on HDFS or launches a MapReduce job
to execute a query. This means it follows the same rules as the Hadoop commands described before and
only supports Kerberos and delegation tokens. In general, the hive command is deprecated and users
should use beeline instead.
When using beeline or JDBC drivers, users connect to the HiveServer2 daemon which handles query
parsing and execution. HiveServer2 supports Kerberos, LDAP, and custom authentication plug-ins. Only
one authentication provider can be configured at a time, so administrators need to choose the preferred
authentication mechanism to use when configuring the HiveServer2 daemon. A workaround for this
limitation is to run multiple HiveServer2 daemons that share the same Hive metastore. This requires that
end users connect to the correct HiveServer2 depending on their authentication needs. The authentication
mechanism for HiveServer2 is configured in the hive-site.xml file. See Table 11-2 for a description of the
HiveServer2 authentication configuration properties.
Table 11-2. Configuration properties for HiveServer2 authentication
Property Description
![]()
![]()
![]()
![]()
![]()
![]()
Client authentication type. Valid values: , , ,
The Kerberos principal for the HiveServer2 daemon
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
The keytab used to authenticate with the KDC
The SASL quality of protection to use with Kerberos connections; valid
values are for authentication only, for authentication and
integrity checks, and for authentication, integrity, and
confidentiality (encryption)
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Set to true to enable TLS between clients the HiveServer2 daemon
The path to a Java keystore file with the private key to use with TLS
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
The password for the Java keystore file
The URL to the LDAP/Active Directory server; only used if
is set to
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
The Active Directory domain to authenticate against; only used if
![]()
![]()
![]()
![]()
![]()
points to an AD server
The base distinguished name to use when
points to an OpenLDAP
server
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
The name of a class that implements the
interface; used when
is set to
![]()
![]()
![]()
![]()
Using HiveServer2 with Kerberos authentication
Configuring HiveServer2 with Kerberos authentication follows the same pattern as with the core Hadoop
services described in “Configuration”. Namely, we need to set the authentication type to Kerberos and set
the Kerberos principal and keytab. When setting the Kerberos principal, we can use the _HOST wildcard
placeholder. This will automatically be replaced with the fully qualified domain name of the server
running the HiveServer2 daemon. An example snippet of the hive-site.xml file enabling Kerberos
authentication is shown in Example 11-21.
Example 11-21. Configuration for Kerberos authentication with HiveServer2
![]()

![]()
JDBC clients connecting to a Kerberos-enabled HiveServer2 daemon need to have a valid Kerberos
TGT, and need to add the principal of the HiveServer2 daemon to their connection string. Example 11-22
shows how to create a connection string for Kerberos authentication.
Example 11-22. JDBC connection string for Kerberos authentication
// Start with the basic JDBC URL string
String url = "jdbc:hive2://hive.example.com:10000/default";
// Add the Hive Kerberos principal name, including the FQDN and realm
url = url + ";principal=hive/hive.example.com@EXAMPLE.COM";
// Create the connection from the URL
Connection con = DriverManager.getConnection(url);
The Beeline shell uses the Hive JDBC driver to connect to HiveServer2. You need to obtain your
Kerberos TGT using kinit and then connect using the same JDBC connection string shown earlier in order
to use Kerberos authentication with Beeline. See Example 11-23 for an example. Even though you’re
using Kerberos for authentication, Beeline will prompt for a username and password. You can leave these
blank and just hit Enter, as shown in the example.
Example 11-23. Beeline connection string for Kerberos authentication
[alice@hadoop01 ~]$ kinit
Password for alice@EXAMPLE.COM:
[alice@hadoop01 ~]$ beeline
Beeline version 0.13.1 by Apache Hive
scan complete in 2ms
Connected to: Apache Hive (version 0.13.1)
Driver: Hive JDBC (version 0.13.1)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://hive.example.com> show tables;
+------------+--+
| tab_name |
+------------+--+
| sample_07 |
| sample_08 |
+------------+--+
2 rows selected (0.261 seconds)
0: jdbc:hive2://hive.example.com>
Using HiveServer2 with LDAP/Active Directory authentication
HiveServer2 also supports username/password authentication backed by LDAP. To use LDAP-based
authentication, set the authentication type to LDAP, configure the LDAP URL, and then either set a domain
name or base distinguished name for binding. The domain name is used when you’re binding against an
Active Directory server while the base DN is used for other LDAP providers such as OpenLDAP or
freeIPA.
WARNING
By default, the connection between clients and HiveServer2 are not encrypted. This means that when you’re using either LDAP
or a custom authentication provider, the username and password could be intercepted by a third party. When using a non-
Kerberos authentication provider, it’s strongly recommended to enable HiverServer2 over-the-wire encryption using TLS, as
shown in “HiveServer2 over-the-wire encryption”.
Regardless of the LDAP provider, it’s strongly recommend that you use LDAPS (LDAP over SSL) rather
than straight LDAP. This will ensure that communication between HiveServer2 and the LDAP server is
encrypted. In order to use LDAPS, you need to make sure that LDAP server certificate or the CA signing
certificate is loaded into a Java truststore. This can either be the system-wide Java truststore located at
$JAVA_HOME/jre/lib/security/cacerts or a specific truststore for use with Hive. If using a specific
truststore, you need to set the javax.net.ssl.trustStore and javax.net.ssl.trustStorePassword system
properties. This can be done by setting the variable in the hive-env.sh file similar to
Example 11-24.
Example 11-24. Setting the LDAPS truststore for Hive
![]()
![]()
WARNING
The truststore password used here will be visible to anyone that can perform a process listing on the server running HiveServer2.
You must protect the truststore file itself with strong permissions to prevent users from modifying the truststore.
![]()
![]()
![]()
![]()
![]()
![]()
![]()
If you’re configuring HiveServer2 to authenticate against an Active Directory server, then you need to set
the setting in hive-site.xml to your AD domain name in
addition to the common LDAP settings. See Example 11-25 for an example configuration.
Example 11-25. Configuration for Active Directory authentication with HiveServer2

If you’re using another LDAP provider, such as OpenLDAP or freeIPA, then you need to set the
![]()
![]()
![]()
![]()
![]()
![]()
![]()
property rather than the domain name. The base DN
will depend on your environment, but the default for common OpenLDAP installations is
where
will be replaced with your LDAP
server’s domain components. Typically, this is the domain name of the LDAP server. For freeIPA, the
default base DN will be
. Again, substitute in the domain
components for your environment. A complete configuration example is provided in Example 11-26.
Example 11-26. Configuration for LDAP authentication with HiveServer2

WARNING
Some versions of Hive (notably Hive 0.13.0 and 0.13.1) have a bug where they won’t use Kerberos authentication to
communicate with Hadoop when the authentication type is set to something other than KERBEROS. When using these versions
of Hive, you should only use Kerberos for authentication.
Connecting to HiveServer2 when configured for LDAP/Active Directory authentication is easily handled
by passing the username and password to the DriverManager when getting a connection. See Example 11-
27 for an example.
Example 11-27. JDBC connection string for LDAP/Active Directory authentication
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Connecting with Beeline is much the same way. This time, you’ll enter the username and password when
prompted following the command. See Example 11-28 for an example.
Example 11-28. Beeline connection string for LDAP/Active Directory authentication
![]()
![]()
Using HiveServer2 with pluggable authentication
![]()
Hive has a pluggable interface for implementing new authentication providers. Hive calls this
authentication mode and it requires a Java class that implements the
![]()
![]()
![]()
![]()
![]()
![]()
![]()
interface. This interface defines
![]()
![]()
![]()
an
method that you implement to verify the supplied
![]()
username and password. As its name suggests, this pluggable interface only works with authentication
providers that use usernames and passwords for authentication. You configure this mode by setting the
authentication type to and setting the authentication class. You also have to add the JAR file with
your class in it to Hive’s classpath. The easiest way is to add the path to your JAR to the
![]()
![]()
![]()
![]()
setting in hive-site.xml. This property takes a comma-delimited list of full paths to
JARs. See Example 11-29 for an example configuration.
Example 11-29. Configuration for pluggable authentication with HiveServer2

The connection settings for custom authentication is the same as for LDAP/Active Directory-based
authentication. See Examples 11-18 and 11-28 for a illustration of this.
HiveServer2 over-the-wire encryption
![]()
![]()
The Hive JDBC driver supports two methods of enabling encryption over the wire. The method you use
will depend on the method of authentication you’re using and your version of Hive. When you use
Kerberos authentication, the Hive JDBC driver uses SASL to perform the Kerberos authentication. SASL
supports integrity checks and encryption when doing authentication based on a configuration setting called
the quality of protection. To enable encryption, set the SASL QOP to , which is short for
authentication with confidentiality. See Example 11-30 to see how to configure HiveServer2 to use SASL
for encryption.
Example 11-30. Configuring HiveServer2 to use SASL encryption
![]()
![]()
![]()
![]()
![]()
When the SASL QOP is enabled on the server side, you need to make sure the client sets it to the same
value. This can be done by adding the option to the JDBC URL. Example 11-31
shows how to use SASL encryption with Beeline.
Example 11-31. Beeline connection string with SASL encryption

![]()

If you’ve configured Hive to use username/password-based authentication, such as LDAP/Active
Directory, then Hive will no longer use SASL to secure the connection. That means an alternative is
needed to enable encryption. Starting with Hive 0.13 and later, you can configure Hive 0.13 or later to
use TLS/SSL for encryption. Before you can configure Hive to use TLS, you need the private key and
certificate for your server in a Java keystore file. Assuming that you already have your private key and
certificate in a PKCS12 file, you can import them into a Java keystore following the process shown in
Example 11-32. Hive requires that the private key’s password be set to the same as the keystore’s
![]()
password. We handle that in the example by setting both
and
on the
command line. In addition, we provided the parameter for the key/certificate we’re importing.
A NOTE ON HIVE VERSIONS
Setting the SASL QOP property is only available in Hive 0.12.0 or later, and support for TLS encryption requires Hive 0.13.0 or
later.
![]()
Example 11-32. Importing a PKCS12 private key into a Java keystore
After creating our Java keystore, we’re ready to configure Hive to use it. Set the configuration properties
shown in Example 11-33 in the hive-site.xml file for HiveServer2.
Example 11-33. Configuring HiveServer2 to use TLS for encryption
![]()

![]()
![]()
NOTE
TLS cannot be configured when using Kerberos for authentication. If you’re using Kerberos for authentication, then use SASL
QOP for encryption and use TLS otherwise.
![]()
![]()
![]()
![]()
Finally, we need to enable TLS on the client side by adding to the JDBC URL. If your
certificate is not signed by a central certificate authority, then you also need to specify a truststore in the
JDBC URL. When we configured Hive to use LDAPS, we created a truststore that we can reuse here by
copying the truststore file to the client server and setting the and
parameters in the JDBC URL. See Example 11-34 for a full example of using TLS for encryption with
Beeline.
Example 11-34. Beeline connection string with TLS

WebHDFS/HttpFS
Hadoop has two methods of exposing a REST interface to HDFS: WebHDFS and HttpFS. Both systems
use the same API so the same client can work with either; the difference is in how they’re deployed and
where the access to data lives. WebHDFS isn’t actually a separate service and runs inside the NameNode
and DataNodes. Because WebHDFS runs on the NameNode and DataNodes, it’s not suitable for users that
don’t have direct access to the cluster. In practice, WebHDFS is most commonly used to provide version-
independent access for bulk access utilities such as DistCp, the distributed copy command. See
Example 5-10 in Chapter 5 for the example configuration for WebHDFS.
In contrast, HttpFS runs as a gateway service similar to the HBase REST gateway. The first step of
configuring HttpFS with authentication is to configure HttpFS to use Kerberos to authenticate against
HDFS:
![]()

Next, we set the authentication method that the HttpFS server will use to authenticate clients. Again we
use Kerberos, which will be implemented over SPNEGO:
![]()

![]()
Lastly, we need to configure HttpFS to allow the Hue user to impersonate other users. This is done with
the typical proxy user settings—for example, the following settings will allow the user to
impersonate users from any host and in any group:
![]()
![]()

![]()
![]()
Summary
In this chapter, we took a deep dive into how clients access a Hadoop cluster to take advantage of the
many services it provides and the data it stores. What is immediately obvious is that securing this access
is a daunting task because of the myriad of access points available to clients. A key theme throughout,
however, is that clients must obey the established authentication and authorization methods, such as those
provided by Kerberos and LDAP.
We also spent some time on how users get data out of the cluster with Sqoop, Hive, Impala, WebHDFS,
and HttpFS. While the Hadoop ecosystem itself has grown over the years, so too has the wide ecosystem
of business intelligence, ETL, and other related tools that interact with Hadoop. For this reason, having a
solid grasp on data extraction capabilities of the platform and the modes to secure them is critical for an
administrator to understand.
1 Refer back to Table 4-1 for a refresher on TGTs.
Chapter 12. Cloudera Hue
Hue is a web application that provides an end-user focused interface for a large
number of the projects in the Hadoop ecosystem. When Hadoop is configured
with Kerberos authentication, then Hue must be configured with Kerberos
credentials to properly access Hadoop. Kerberos is enabled by setting the
following parameters in the hue.ini file:
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
The Kerberos principal name for the Hue, including the fully qualified
domain name of the Hue server
![]()
![]()
![]()
![]()
The path to the Kerberos keytab file containing Hue’s service credentials
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
The path to the Kerberos command (not needed if is on the
path)
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
The frequency in seconds for Hue to renew its Kerberos tickets
![]()
![]()
![]()
![]()
These settings should be placed under the subsection of the
![]()
![]()
![]()
![]()
![]()
top-level section in the hue.ini file. See Example 12-1 for a sample
Hue kerberos configuration.
Example 12-1. Configuring Kerberos in Hue

![]()
Hue has its own set of authentication backends and authenticates against Hadoop
andother projects using Kerberos. In order to perform actions on behalf of other
users, Hadoop must be configured to trust the Hue service. This is done by
configuring Hadoop’s proxy user/user impersonation capabilities. This is
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
controlled by setting the hosts Hue can run on and the groups of users that Hue
can impersonate. Either value can be set to * to indicate that impersonation is
enabled from all hosts or from all groups, respectively. Example 12-2 shows
how to enable Hue to impersonate users when accessing Hadoop from the host
hue.example.com and for users in the group.
Example 12-2. Configuring Hue User Impersonation for Hadoop in core-
site.xml

HBase and Hive use the Hadoop impersonation configuration, but Oozie must be
configured independently. If you want to use Oozie from Hue, you must set the
and
properties
in the oozie-site.xml file. Example 12-3 shows how to enable Hue to
impersonate users when accessing Oozie from the host
and
for users in the hadoop-users group.
Example 12-3. Configuring Hue user impersonation for Oozie in oozie-site.xml
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
If you’re using the Hue search application, you also need to enable impersonation
in Solr. This is done by setting the ,
![]()
![]()
![]()
![]()
, and
![]()
![]()
![]()
![]()
environment variables in the
/etc/default/solr file. See Example 12-4 for a sample configuration to enable
impersonation from the host
and for users in the hadoop-group.
Example 12-4. Configuring Hue user impersonation for Solr in
/etc/default/solr
![]()
![]()
![]()
Hue HTTPS
By default, Hue runs over plain old HTTP. This is suitable for proofs of concept
or for environments where the network between clients and Hue is fully trusted.
However, for most environments it’s strongly recommended that you configure
Hue to use HTTPS. This is especially important if you don’t fully trust the
network between clients and Hue, as most of Hue’s authentication backends
support entering in a username and password through a browser form.
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Fortunately, Hue makes configuring HTTPS easy. To do so, you simply configure
the and settings, which are both under the
![]()
![]()
![]()
![]()
![]()
section of the hue.ini file. Both files should be in PEM format and the
private key cannot be encrypted with a passphrase. See Example 12-5 for a
sample configuration.
Example 12-5. Configuring Hue to use HTTPS
![]()
![]()
WARNING
Hue does not currently support using a private key that is protected with a passphrase. This
means it’s very important that Hue’s private key be protected to the greatest extent possible.
Ensure that the key is owned by the user and is only readable by its owner (e.g.,
). You might also configure filesystem-level encryption on the
filesystem, storing the private key as described in “Filesystem Encryption”. In cases where Hue
is on a server that has other resources protected by TLS/SSL, it’s strongly recommended that
you issue a unique certificate just for Hue. This will lower the risk if Hue’s private key is
compromised by protecting other services running on the same machine.

Hue Authentication
![]()
![]()
![]()
![]()
![]()
Hue has a pluggable authentication framework and ships a number of useful
authentication backends. The default authentication backend uses a private list of
usernames and passwords stored in Hue’s backing database. The backend is
configured by setting the property to
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
under the
subsection of the section. See Example 12-6 for a sample
hue.ini file where the backend is explicitly set. Because this is the default, you
can also leave this setting out entirely.
Example 12-6. Configuring the default Hue authentication backend
![]()
![]()
![]()
![]()
Hue also has support for using Kerberos/SPNEGO, LDAP, PAM, and SAML for
authentication. We won’t cover all of the options here, so refer to the
![]()
![]()
![]()
![]()
![]()
command for more information.1
SPNEGO Backend
Simple and Protected GSSAPI Negotiation Mechanism (SPNEGO)2 is a
GSSAPI pseudo-mechanism for allowing clients and servers to negotiate the
choice of authentication technology. SPNEGO is used any time a client wants to
authenticate with a remote server but neither the client nor the server knows in
advance the authentication protocols the other supports. The most common use of
SPNEGO is with the HTTP negotiate protocol first proposed by Microsoft.3
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Hue only supports SPNEGO with Kerberos V5 as the underlying mechanism. In
particular this means you can’t use Hue with the Microsoft NT LAN Manager(NTLM) protocol. Configuring SPNEGO with Hue requires setting the Hue
authentication backend to SpnegoDjangoBackend (see Example 12-7), as well as
setting the environment variable to the location of a keytab file that
has the key for the ![]()
principal. If you’re starting Hue by hand on the server
and
your keytab is located in /etc/hue/conf/hue.keytab, then you’d start Hue as
shown in Example 12-8.
Example 12-7. Configuring the SPNEGO Hue authentication backend
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Example 12-8. Setting KRB5_KTNAME and starting Hue manually
![]()
![]()
![]()
![]()
![]()
In order to use SPNEGO, you also need to have a TGT on your desktop (e.g., by
running ) and you need to use a browser that supports SPNEGO. Internet
Explorer and Safari both support SPNEGO without additional configuration. If
you’re using Firefox, you first must add the server or domain name you’re
authenticating against to the list of trusted URIs. This is done by typing
in the URL bar, then searching for
![]()
![]()
![]()
![]()
![]()
, and then updating that preference to include the server
name or domain name. For example, if you wanted to support SPNEGO with any
server on the
domain, you would set
![]()
![]()
![]()
![]()
![]()
. If you see the message
while trying to connect to Hue, you likely don’t have your trusted URIs
configured correctly in Firefox.
SAML Backend
Hue also supports using the Security Assertion Markup Language (SAML)
standard for single sign-on (SSO). SAML works by separating service(SP) from identity providers (IdP). When you request access to a
resource, the SP will redirect you to the IdP where authentication will take
place. The IdP will then pass an assertion validating your identity to the SP who
will grant access to the target resource. The Wikipedia article on SAML has
more details, including a diagram showing the steps of the SAML process.
When configured with the SAML authentication backend, Hue will act as a
service provider and redirect to your identity provider for authentication.
Configuring Hue to use SAML is more complicated than the other authentication
backends. Hue has to interact with a third-party identity provider so some of the
details will depend on which identity provider you’re using. Also, Hue doesn’t
ship with several of the dependencies required to use SAML. So we’ll start by
installing the required dependencies by following the steps in Example 12-9.
Example 12-9. Install dependencies for SAML authentication backend
[root@hue ~]# yum install swig openssl openssl-devel gcc python-devel
Loaded plugins: fastestmirror, priorities
Loading mirror speeds from cached hostfile
base: mirror.hmc.edu
extras: mirrors.unifiedlayer.com
updates: mirror.pac-12.org
...
Complete!
[root@hue ~]# yum install xmlsec1 xmlsec1-openssl
Loaded plugins: fastestmirror, priorities
Loading mirror speeds from cached hostfile
base: mirror.hmc.edu
extras: mirrors.unifiedlayer.com
updates: mirror.pac-12.org
...
Complete!
[root@hue ~]# $HUE_HOME/build/env/bin/pip install --upgrade setuptools
Downloading/unpacking setuptools from https://pypi.python.org/packages/sourc
e/s/setuptools/setuptools-7.0.tar.gz#md5=6245d6752e2ef803c365f560f7f2f940
Downloading setuptools-7.0.tar.gz (793Kb): 793Kb downloaded
...
Successfully installed setuptools
Cleaning up...
[root@hue ~]# $HUE_HOME/build/env/bin/pip install -e \
git+https://github.com/abec/pysaml2@HEAD#egg=pysaml2
Obtaining pysaml2 from git+https://github.com/abec/pysaml2@HEAD#egg=pysaml2
Updating ./build/env/src/pysaml2 clone (to HEAD)
...
Successfully installed pysaml2 m2crypto importlib WebOb
Cleaning up...
[root@hue ~]# $HUE_HOME/build/env/bin/pip install -e \
git+https://github.com/abec/djangosaml2@HEAD#egg=djangosaml2
Obtaining djangosaml2 from git+https://github.com/abec/djangosaml2@HEAD#egg=
djangosaml2
Cloning https://github.com/abec/djangosaml2 (to HEAD) to ./build/env/src/d
jangosaml2
...
Successfully installed djangosaml2
Cleaning up...
This will install some development tools and then install the Python modules
required to work with SAML. After installing the dependencies, you need to
download the metadata file from your identity provider. The details will vary
depending on which identity provider you’re using. For the Shibboleth Identity
![]()
![]()
![]()
Provider, you can use to download the metadata to
/etc/hue/saml/metadata.xml.
![]()
![]()
![]()
You also need a certificate and private key to sign requests with. This has to be a
key trusted by your identity provider to sign requests, so you might not be able to
just reuse the same key and certificate you used when enabling HTTPS for Hue.
For our purposes, we’ll assume that the key and certificate have been created and
placed into the /etc/hue/saml/key.pem and /etc/hue/saml/idp.pem files,
respectively. All that’s left is to configure Hue itself. See Example 12-10 for the
relevant sections from the /etc/hue/conf/hue.ini file.
Example 12-10. Configuring the SAML Hue authentication backend
![]()
![]()
![]()
![]()

![]()
![]()
![]()
![]()
There are additional optional configuration parameters that can be set in the
configuration group. The full list of configuration parameters is shown here:
![]()
![]()
![]()
![]()
![]()
![]()
Path to the xmlsec1 binary, which is the executable used to sign, verify,
encrypt, and decrypt SAML requests and assertions. Typically
/usr/bin/xmlsec1.
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Create Hue users upon login. Can be or .
![]()
![]()
![]()
![]()
Attributes Hue demands from the IdP. Comma-separated list of attributes.
Example: uid,email
optional_attributes
Attributes Hue can handle from the IdP. Comma-separated list of attributes.
Handled the same way as required_attributes.
metadata_file
Path to the metadata XML file from the IdP. The file must be readable by the
hue user.
key_file
PEM-formatted key file.
cert_file
PEM-formatted X.509 certificate.
user_attribute_mapping
Maps attributes received from the IdP (specified in required_attributes,
optional_attributes, and the IdP config) to Hue user attributes. Example:
{uid:'username’, email: email}
authn_requests_signed
Sign authentication requests. Can be true or false. Check the
documentation of your IdP to see if this setting is required.
logout_requests_signed
Sign logout requests. Can be true or false. Check the documentation of
your IdP to see if this setting is required.
Hue with a DN pattern which is filled in with a username, and then a bind is
performed without a search.
When configuring Hue with search bind, you must set
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
to , , and the setting in
the subsection of the section. You also must set the
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
and settings in the subsection of the subsection of
the section. You should also set the , ,
and settings in the subsection of the
![]()
![]()
![]()
![]()
![]()
subsection of the section so that you can import LDAP groups as Hue
groups.
A snippet of a hue.ini configuration file configured to do LDAP authentication
with a search bind is shown in Example 12-11. In this example, the LDAP server
is running with LDAPS on
. This LDAP server stores users
and groups under a base DN of
. Finally,
user accounts are in
and groups are in
. A complete description of all of the LDAP-related
settings is shown in Table 12-1.
Example 12-11. Configuring the LDAP Hue authentication backend with
search bind
![]()
![]()
![]()
![]()
![]()

![]()
![]()
![]()
![]()
![]()

![]()
![]()
If you prefer to use a direct bind, then you must set
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
to and set either or
depending on whether you’re using Active Directory
or another LDAP provider, respectively. You must still configure the search-
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
related settings (e.g., , , etc.) as those will be
used when syncing users and groups from LDAP. If we want to use the same
server setup as before but use direct bind instead of search bind, we would use a
configuration similar to Example 12-12. Again, the full set of LDAP
configuration parameters is shown in Table 12-1.
Example 12-12. Configuring the LDAP Hue authentication backend with direct
bind
![]()
![]()
![]()
![]()
![]()

![]()
![]()
![]()
![]()
![]()

![]()
![]()
Table 12-1. Configuration properties for Hue LDAP authentication
Section Property Description
![]()
![]()
![]()
The authentication backend to use (set to
![]()
![]()
![]()
![]()
)
The LDAP server URL (use
secure LDAP)
for
![]()
![]()
![]()
![]()
![]()
![]()
Section | Property | Description |
desktop.ldap | base_dn | The base LDAP distinguished name to use |
desktop.ldap | bind_dn | The distinguished name to bind as when |
desktop.ldap | bind_password | The password for the bind_dn user; only |
desktop.ldap | create_users_on_login | Set to true to create users the first time they |
desktop.ldap | search_bind_authentication | Set to true to use search bind; false |
desktop.ldap | ldap_username_pattern | The pattern used to construct distinguished |
string <username>, which will be replaced | ||
search_bind_authentication=false) | ||
desktop.ldap | nt_domain | The NT domain of the Active Directory |
search_bind_authentication=false) | ||
desktop.ldap | ldap_cert | The location of the CA certificate used to |
desktop.ldap | use_start_tls | Set to true to use StartTLS; set to false |
when using an ldaps:// URL for | ||
ldap_url | ||
desktop.ldap.users | user_filter | The base filter to use when searching for |
Section Property Description
desktop.ldap.users user_name_attr The username attribute in the LDAP
schema (this is typically sAMAccountName
for Active Directory and uid for other
LDAP directories)
desktop.ldap.groups group_filter
The base filter to use when searching for
groups
desktop.ldap.groups group_name_attr The group name attribute in the LDAP
schema (this is typically cn)
desktop.ldap.groups group_member_attr
The LDAP attribute that specifies members
of a group (this is typically member)
One thing you’ll notice in Examples 12-11 and 12-12 is that we set ldap_cert to
point to a CA certificate. This is needed because we configured the LDAP URL
using the ldaps:// scheme. It’s strongly recommended that you use either LDAPS
or StartTLS. When using StartTLS, configure the LDAP URL with the ldap://
scheme and set use_start_tls to true.
Add and delete users
Addand delete groups
Assign permissions to groups
Change a user into a superuser
Import users and groups from an LDAP server
Install the example queries, tables, and data
View, submit, and modify any Oozie workflow, coordinator, or bundle
Impersonate any user when viewing and modifying Sentry permissions
Impersonate any user when viewing and modifying HDFS ACLs
NOTE
The Hue superuser is not the same as the HDFS superuser. The HDFS superuser is the user
that runs the NameNode daemon, typically hdfs, and has permission to list, read, and write any
HDFS files and directories. If you want to perform HDFS superuser actions from Hue, you
need to add a user with the same username as the HDFS superuser. Alternatively, you can set
the HDFS super group to assign a group of users HDFS superuser privileges. See “HDFS
Authorization”.
Each Hue application defines one or more actions that users can perform.
Authorization is controlled by setting an ACL per action that lists the groups that
can perform that action. Every application has an action called “Launch this
application” which controls which users can run that application. Several
applications define additional actions that can be controlled.
NOTE
The permissions granted in Hue only grant privileges to invoke the given action from the given
Hue app. The user performing an action will still need to be authorized by the service they’re
accessing. For example, a user might have permissions for the “Allow DDL operations” in the
Metastore app, but if she doesn’t have the ALL privilege on the database in Sentry, she won’t be
able to create tables.
The HBase app defines the “Allow writing in the HBase app” action, which
gives permissions to add rows, add cells, edit cells, drop rows, and drop cells
from the HBase app. The Metastore app defines the “Allow DDL operations”
action, which gives permission to create, edit, and drop tables from the metastore
browser. The Oozie app defines the “Oozie Dashboard read-only user for all
jobs,” which grants permission to have read-only access to all workflows,
coordinators, and bundles, regardless of whether they’re shared. The Security
app defines the “Let a user impersonate another user when listing objects like
files or tables” action. This action lets the user impersonate other users and see
what tables, files, and directories that user has access to.
WARNING
Granting permission for the “Let a user impersonate another user when listing objects like files
or tables” can expose information that would otherwise not be available. In particular, it allows a
user to impersonate a user that has access to see files in a directory for which the logged-in
user is not authorized. Permissions to perform this action should be granted sparingly. It’s also
worth warning that Hue superusers also have access to impersonate users in the Security app.
Thus care should also be taken in making a user a Hue superuser.
The Useradmin app defines the “Access to profile page on User Admin” action,
but this action is deprecated and can be safely ignored.
Hue SSL Client Configurations
In this Hue section, we have covered a lot of pertinent security configurations,
but certainly have not exhaustively covered how to set up and configure Hue in
the general case, which we deem out of scope for this book. However, one
important aspect to explain is how to properly set up Hue when the various
underlying components are set up with SSL wire encryption. If Hadoop, Hive,
and Impala have SSL enabled, Example 12-13 shows the snippets that are
necessary.
Example 12-13. Hue SSL client configurations
![]()
![]()
![]()
![]()
![]()
![]()
![]()

![]()
![]()

![]()
In both the Hive and Impala configurations, the option specifies
whether Hue should check that the certificates presented by those services are
signed by an authority in the configured certificate authority chain.
In addition to what is shown in the example, an environment variable
REQUESTS_CA_BUNDLE needs to point to the location on disk where the SSL
certificate authority chain file is (in PEM format). This is used for the Hadoop
SSL client to HDFS, MapReduce, YARN, and HttpFS.
In this chapter, we took a close look at Hue’s important role in allowing end
users to access several different ecosystem components through a centralized
web console. With Hue, users are able to log in once to a web console, with the
rest of their cluster actions performed via impersonation from a Hue system user.
1 You can execute the config_help command with either
/usr/share/hue/build/env/bin/hue config_help or
/opt/cloudera/parcels/CDH/lib/hue/build/env/bin/hue config_help
depending on how Hue was installed.
2 See RFC 4178 for a description of the SPNEGO pseudo-mechanism.
3 See Microsoft’s MSDN article for details.
Part IV. Putting It All Together
In this chapter, we present two case studies that cover many of the security
topics in the book. First, we’ll take a look at how Sentry can be used to
control SQL access to data in a multitenancy environment. This will serve as
a good warmup before we dive into a more detailed case study that shows a
custom HBase application in action with various security features in place.
Case Study: Hadoop Data Warehouse
First, let’s list the assumptions we are making for this case study:
The environment consists of three lines of business, which we will call
lob1, lob2, and lob3
Each line of business has analysts and administrators
The analysts are defined by the groups lob1grp, lob2grp, and
lob3grp
The administrators are defined by the groups lob1adm, lob2adm, and
lob3adm
Administrators are also in the analysts groups
Each line of business needs to have its own sandbox area in HDFS to do
ad hoc analysis, as well as to upload self-service data sources
Each line of business has its own administrators that control access to
their respective sandboxes
Data inside the Hive warehouse is IT-managed, meaning only
noninteractive ETL users add data
Only Hive administrators create new objects in the Hive warehouse
The Hive warehouse uses the default HDFS location
/user/hive/warehouse
Kerberos has already been set up for the cluster
Sentry has already been set up in the environment
HDFS already has extended ACLs enabled
The default umask for HDFS is set to 007
Now that we have the basic assumptions, we need to set up the necessary
directories in HDFS and prepare them for Sentry. The first thing we will do
is lock down the Hive warehouse directory. HiveServer2 impersonation is
disabled when enabling Sentry, so only the hive group should have access
(which includes the hive and impala users). Here’s what we need to do:
[root@server1 ~]# kinit hive
Password for hive@EXAMPLE.COM:
[root@server1 ~]# hdfs dfs -chmod -R 0771 /user/hive/warehouse
[root@server1 ~]# hdfs dfs -chown -R hive:hive /user/hive/warehouse
[root@server1 ~]#
As mentioned in the assumptions, each line of business needs a sandbox area.
We will create the path /data/sandbox as the root directory for all the
sandboxes, and create the associated structures within it:
[root@server1 ~]# kinit hdfs
Password for hdfs@EC2.INTERNAL:
[root@server1 ~]# hdfs dfs -mkdir /data
[root@server1 ~]# hdfs dfs -mkdir /data/sandbox
[root@server1 ~]# hdfs dfs -mkdir /data/sandbox/lob1
[root@server1 ~]# hdfs dfs -mkdir /data/sandbox/lob2
[root@server1 ~]# hdfs dfs -mkdir /data/sandbox/lob3
[root@server1 ~]# hdfs dfs -chmod 770 /data/sandbox/lob1
[root@server1 ~]# hdfs dfs -chmod 770 /data/sandbox/lob2
[root@server1 ~]# hdfs dfs -chmod 770 /data/sandbox/lob3
[root@server1 ~]# hdfs dfs -chgrp lob1grp /data/sandbox/lob1
[root@server1 ~]# hdfs dfs -chgrp lob2grp /data/sandbox/lob2
[root@server1 ~]# hdfs dfs -chgrp lob3grp /data/sandbox/lob3
[root@server1 ~]#
Now that the basic directory structure is set up, we need to start thinking
about what is needed to support Hive and Impala access to the sandbox.
After all, these sandboxes are the place where all the users will be doing
their ad hoc analytic work. Both the hive and impala users need access to
these directories, so let’s go ahead and set up HDFS-extended ACLs to
allow the hive group full access:
[root@server1 ~]# hdfs dfs -setfacl -m default:group:hive:rwx
/data/sandbox/lob1
[root@server1 ~]# hdfs dfs -setfacl -m default:group:hive:rwx
/data/sandbox/lob2
[root@server1 ~]# hdfs dfs -setfacl -m default:group:hive:rwx
/data/sandbox/lob3
[root@server1 ~]# hdfs dfs -setfacl -m group:hive:rwx /data/sandbox/lob1
[root@server1 ~]# hdfs dfs -setfacl -m group:hive:rwx /data/sandbox/lob2
[root@server1 ~]# hdfs dfs -setfacl -m group:hive:rwx /data/sandbox/lob3
[root@server1 ~]#
WARNING
Remember, the default ACL is only applicable to directories, and it only dictates the ACLs
that are copied to new subdirectories and files. Because of this fact, the parent directories
still need a regular access ACL.
The next part we need to do is to make sure that regardless of who creates
new files, all the intended accesses persist. If we left the permissions as they
are right now, new directories and files created by the hive or impala users
may actually be accessible by the analysts and administrators in the line of
business. To fix that, let’s go ahead and add those groups to the extended
ACLs:
[root@server1 ~]# hdfs dfs -setfacl -m default:group:lob1grp:rwx \
/data/sandbox/lob1
[root@server1 ~]# hdfs dfs -setfacl -m default:group:lob1adm:rwx \
/data/sandbox/lob1
[root@server1 ~]# hdfs dfs -setfacl -m default:group:lob2grp:rwx \
/data/sandbox/lob2
[root@server1 ~]# hdfs dfs -setfacl -m default:group:lob2adm:rwx \
/data/sandbox/lob2
[root@server1 ~]# hdfs dfs -setfacl -m default:group:lob3grp:rwx \
/data/sandbox/lob3
[root@server1 ~]# hdfs dfs -setfacl -m default:group:lob3adm:rwx \
/data/sandbox/lob3
[root@server1 ~]# hdfs dfs -setfacl -m group:lob1grp:rwx
/data/sandbox/lob1
[root@server1 ~]# hdfs dfs -setfacl -m group:lob1adm:rwx
/data/sandbox/lob1
[root@server1 ~]# hdfs dfs -setfacl -m group:lob2grp:rwx
/data/sandbox/lob2
[root@server1 ~]# hdfs dfs -setfacl -m group:lob2adm:rwx
/data/sandbox/lob2
[root@server1 ~]# hdfs dfs -setfacl -m group:lob3grp:rwx
/data/sandbox/lob3
[root@server1 ~]# hdfs dfs -setfacl -m group:lob3adm:rwx
/data/sandbox/lob3
[root@server1 ~]#
Now that we have all the extended ACLs set up, let’s take a look at one of
them:
[root@server1 ~]# hdfs dfs -getfacl -R /data/sandbox/lob1
# file: /data/sandbox/lob1
# owner: hdfs
# group: lob1grp
user::rwx
group::rwx
group:hive:rwx
group:lob1adm:rwx
group:lob1grp:rwx
mask::rwx
other::---
default:user::rwx
default:group::rwx
default:group:hive:rwx
default:group:lob1adm:rwx
default:group:lob1grp:rwx
default:mask::rwx
default:other::---
[root@server1 ~]#
We have handled all of the tenants in the cluster, so let’s make sure we also
create a space in HDFS for the ETL noninteractive user to use:
[root@server1 ~]# hdfs dfs -mkdir /data/etl
[root@server1 ~]# hdfs dfs -chown etluser:hive /data/etl
[root@server1 ~]# hdfs dfs -chmod 770 /data/etl
[root@server1 ~]# hdfs dfs -setfacl -m default:group:hive:rwx /data/etl
[root@server1 ~]# hdfs dfs -setfacl -m group:hive:rwx /data/etl
[root@server1 ~]# hdfs dfs -setfacl -m default:user:etluser:rwx /data/etl
[root@server1 ~]# hdfs dfs -setfacl -m user:etluser:rwx /data/etl
[root@server1 ~]# hdfs dfs -getfacl /data/etl
# file: /data/etl
# owner: etluser
# group: hive
user::rwx
user:etluser:rwx
group::rwx
group:hive:rwx
mask::rwx
other::---
default:user::rwx
default:user:etluser:rwx
default:group::rwx
default:group:hive:rwx
default:mask::rwx
default:other::---
[root@server1 ~]#
The next step is to start doing some administration tasks in Hive using the
beeline shell. We will use the hive user, because by default it is a Sentry
administrator, and can thus create policies.
You can use a properties file for
TIP
to specify connection information. This makes it
much easier than remembering the syntax or looking at your bash history.
![]()
The beeline.properties file we will use is shown in Example 13-1. Note that
the username and password are required but unused for the actual
authentication because Kerberos is enabled.
Example 13-1. beeline.properties file
![]()

![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()

![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Now that we have the administrator role and databases created, we can set
up the Sentry policies that will provide authorization for both Hive and
Impala to end users:
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Another important requirement we listed in the assumptions is that users
should able to upload self-service files to their respective sandboxes. To
allow users to leverage these files in Hive and Impala, they also need some
URI privileges. We will also go ahead and provide write privileges so that
users can also extract data out of Hive and into the sandbox area for
additional non-SQL analysis:
![]()
![]()
![]()
![]()
![]()
![]()
![]()
NOTE
The URI paths shown use the HDFS HA nameservice name. If you do not have HA set
up, you will need to specify the NameNode fully qualified domain name explicitly, including
the port (8020).
User Experience
![]()
![]()
![]()
![]()
![]()
![]()
![]()
With the environment fully up, ready, and outfitted with our full set of HDFS
privileges and Sentry policies, let’s look at what end users see with these
enforcements in place. First, we will look at what a user in the
role sees:
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
As you can see, the role is allowed to see every database that we
set up. This is expected because the role has been granted full
access to the object. Next, we will take a look at what a user
assigned the role sees:
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
This time, the user does not see the full list of databases in the metastore.
Instead, the user sees only the databases that contain objects that they have
some access to. The example shows that not only are objects the user does
not have access to hidden from the user, but that they are denied access even
if the user requests the object by name. This is exactly what we expect to
happen.
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Now let’s say that the table in the database needs to be made
available to the
role. However, the caveat is that not all of the
columns can be shared. For that, we need to create a view that contains only
the columns we intend to make visible to the role. After creating this view,
we grant access to it for the
role:
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
After completing these tasks, we can test access with a user that is assigned
to the
role:
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()

![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
As shown, the is able to see the database in the listing.
However, notice that within the database only the object is
visible. As expected, the user is unable to read the source table either with
SQL access, or from direct HDFS access. Because we saw some “access
denied” messages in this example, let’s inspect what shows up in the logfiles,
starting with the HiveServer2 log:
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()

(Driver.java:995)
at org.apache.hadoop.hive.ql.Driver.compileAndRespond
(Driver.java:988)
at org.apache.hive.service.cli.operation.SQLOperation.prepare
(SQLOperation.java:98)
at org.apache.hive.service.cli.operation.SQLOperation.run
(SQLOperation.java:163)
at org.apache.hive.service.cli.session.HiveSessionImpl.
runOperationWithLogCapture(HiveSessionImpl.java:524)
at org.apache.hive.service.cli.session.HiveSessionImpl.
executeStatementInternal(HiveSessionImpl.java:222)
at org.apache.hive.service.cli.session.HiveSessionImpl.
executeStatement(HiveSessionImpl.java:204)
at org.apache.hive.service.cli.CLIService.executeStatement
(CLIService.java:168)
at org.apache.hive.service.cli.thrift.ThriftCLIService.
ExecuteStatement(ThriftCLIService.java:316)
at org.apache.hive.service.cli.thrift.TCLIService$Processor
$ExecuteStatement.getResult(TCLIService.java:1373)
at org.apache.hive.service.cli.thrift.TCLIService$Processor
$ExecuteStatement.getResult(TCLIService.java:1358)
at org.apache.thrift.ProcessFunction.process
(ProcessFunction.java:39)
at
org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge20S$Server
$TUGIAssumingProcessor.process(HadoopThriftAuthBridge20S.java:608)
at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run
(TThreadPoolServer.java:244)
at java.util.concurrent.ThreadPoolExecutor.runWorker
(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run
(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hive.ql.metadata.AuthorizationException:
User lob1user does not have privileges for QUERY
at
org.apache.sentry.binding.hive.authz.HiveAuthzBinding.authorize
(HiveAuthzBinding.java:317)
at org.apache.sentry.binding.hive.HiveAuthzBindingHook.
authorizeWithHiveBindings(HiveAuthzBindingHook.java:502)
at org.apache.sentry.binding.hive.HiveAuthzBindingHook.
postAnalyze(HiveAuthzBindingHook.java:312)
... 20 more
Next, we see the access-denied audit event that showed up in the NameNode
audit log:
2015-01-13 20:01:15,005 INFO FSNamesystem.audit: allowed=false
ugi=lob1user@EXAMPLE.COM (auth:KERBEROS)
ip=/10.6.9.73
cmd=listStatus src=/data/etl dst=null perm=null
Case Study: Interactive HBase Web
Application
A flexible data model that supports complex objects with rapidly
evolving schemas
Automatic repartitioning of data as nodes are added or removed from the
cluster
Integration with the rest of the Hadoop ecosystem allowing offline
analysis of transactional data
Intra-row ACID transactions
Advanced authorization capabilities for various applications
For our purposes, we’re most interested in the last feature in the list. For
interactive applications, you often have to control which users have access to
which datasets. For example, an application like Twitter has messages that
are fully public, messages that are restricted to a whitelist of authorized
users, and messages that are fully private. Being able to flexibly manage
authorization in the face of such dynamic security requirements requires the
use of a database that is equally dynamic.
In this case study, we’ll take a look at an application for storing and
browsing web page snapshots. This case study is built on top of an open
source, HBase-based web application example from The Kite SDK. The
original example works in a standalone development mode, as an application
deployed on OpenShift, and as a production application deployed on an
HBase cluster. Due to limitations of the MiniHBaseCluster class that is
used for development mode and OpenShift deployments, our version will
only work on production, secured HBase clusters. The full source code for
our version of the example is available in the GitHub source code repository
that accompanies this book.
Let’s start by taking a look at the architecture of the web page snapshot demo
shown in Figure 13-1. The web application gets deployed to an edge node.
The user connects to the application through their browser and provides a
URL to either take a new snapshot or view existing snapshots. When a new
snapshot is taken, the web application downloads the web page and metadata
and stores them in HBase. When a snapshot is viewed, the web application
retrieves the page metadata and the snapshot of the page contents from HBase
and displays it in the browser.
Figure 13-1. Web application architecture
Before we dive into the security requirements, let’s take a look at the data
model used by the example. Each web page is uniquely identified by a URL
and each snapshot is further identified by the time the page was fetched. The
full list of fields in the data model are shown in Table 13-1.
Table 13-1. Web page snapshot data model
Field Type Description
url String The URL of the web page
fetchedAt long The UTC time that this page was fetched
fetchTimeMs int The amount of time it took to fetch the web page, in ms
size int The size of the web page
title String The title of the HTML page, if one exists
description String The description from the HTML meta tag
keywords List<String> The keywords from the HTML meta tag
outlinks List<String> The URLs of pages this page links to

Field | Type | Description |
content | String | The content of the web page |
HBase stores data as a multidimensional sorted map. This means we need to
map the fields of our records to the row key, column family, and column-
qualifier keys that HBase uses to sort data. For our use case, we want each
row in HBase to be keyed by URL and the time the snapshot was fetched. In
order to make the most recent snapshot sort first, we will reverse the order of
the fetchedAt timestamp before using it in the row key by subtracting it
from Long.MAX_VALUE, and we show that as <rev fetchedAt> in
Figure 13-2. Each field will correspond to a single column in HBase so we
define a mapping from each field name to a column family and column
qualifier. Figure 13-2 shows how the row key is mapped and a sample of
field mappings to HBase columns.
Figure 13-2. Original HBase data model mapping

At this point, we’re ready to add security features to the demo. By default, all
of the fields in the snapshots are accessible to any user. For our use case, we
want to lock down the content of the pages by default and only allow access
if we request a snapshot to be made public. We could use cell-level security
and keep the same data model that we used before, but that is probably
overkill for our use case. Instead, we’ll modify the data model slightly.
In particular, we’ll add a field to our model called contentKey. The
contentKey will be used as the column qualifier for storing content. We’ll
use the username as the contentKey for private snapshots and the special
value public for public snapshots. We’re now want to store the content of
each snapshot under a potentially different column qualifier, so we’ll change
the type of the content field to Map<String, String>. The updated
mapping configuration is shown in Figure 13-3.
Figure 13-3. Updated HBase data model mapping

Before continuing, let’s come up with a list of the security requirements we
want to enforce in our application:
The content of private snapshots is only accessible by the user who
took the snapshot
The content of public snapshots is visible to all users
The metadata of all snapshots is visible to all users
Users authenticate with the application using HTTP basic
authentication
The application impersonates the authenticated user when
communicating with HBase
Authorization is enforced at the HBase level
Cluster Configuration
With these requirements in hand, we can start configuring our cluster.
Requirement five (5) implies that the application needs to authenticate with
HBase. In order for HBase authentication to be enabled, we must first enable
Hadoop authentication. To meet requirement six (6), we also have to enable
HBase authorization. HBase authorization is also required to meet
requirements one (1) and two (2). Requirement three (3) implies that we’ll
allow all users access to the metadata fields. The fourth (4) requirement
applies to the web application itself and the application server, Tomcat for
our purposes, used. We’re now ready to plan our configuration steps:
Configure Hadoop authentication (see “Configuration”)
Configure HBase authentication (refer to “Securing Apache HBase” in
The Apache HBase Reference Guide)
Configure HBase authorization by adding the following to hbase-:
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()

![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()

![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()

![]()
![]()
![]()
![]()
![]()
![]()
![]()
Create a Kerberos principal to perform HBase administration
functions:
![]()
![]()
![]()
![]()
![]()
![]()
![]()

Create a Kerberos principal for the application and export the key to a
keytab:
![]()
![]()
![]()
![]()
![]()
![]()

Copy the keytab file into the home directory of the application user
![]()
Grant the application principal create table permissions:
![]()
![]()
![]()
![]()
![]()
![]()
14/11/13 14:45:53 INFO Configuration.deprecation: hadoop.native.lib
is
deprecated. Instead, use io.native.lib.available
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.98.6, rUnknown, Sat Oct 11 15:15:15 PDT 2014
hbase(main):001:0> grant 'web-page-snapshots', 'RWXCA'
0 row(s) in 4.0340 seconds
hbase(main):002:0>
Create the HBase tables:
[app@snapshots ~]$ kinit -kt ~/app.keytab web-page-snapshots
[app@snapshots ~]$ export KITE_USER_CLASSPATH=/etc/hadoop/conf
[app@snapshots ~]$ export \
ZK=zk1.example.com,zk2.example.com,zk3.example.com
[app@snapshots ~]$ kite-dataset create \
dataset:hbase:${ZK}:2181/webpagesnapshots.WebPageSnapshotModel \
-s src/main/avro/hbase-models/WebPageSnapshotModel.avsc
[app@snapshots ~]$ kite-dataset create \
dataset:hbase:${ZK}:2181/webpageredirects.WebPageRedirectModel \
-s src/main/avro/hbase-models/WebPageRedirectModel.avsc
[app@snapshots ~]$
Grant users alice and bob access to the public tables/columns:
[app@snapshots ~]$ kinit -kt ~/app.keytab web-page-snapshots
[app@snapshots ~]$ hbase shell
14/11/13 14:45:53 INFO Configuration.deprecation: hadoop.native.lib
is
deprecated. Instead, use io.native.lib.available
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.98.6, rUnknown, Sat Oct 11 15:15:15 PDT 2014
hbase(main):001:0> grant 'alice', 'RW', 'webpagesnapshots',
'content',
'public'
0 row(s) in 2.9580 seconds
hbase(main):002:0> grant 'alice', 'RW', 'webpagesnapshots', '_s'
0 row(s) in 0.1640 seconds
hbase(main):003:0> grant 'alice', 'RW', 'webpagesnapshots', 'meta'
0 row(s) in 0.2100 seconds
hbase(main):004:0> grant 'alice', 'RW', 'webpagesnapshots',
'observable'
0 row(s) in 0.1600 seconds
hbase(main):005:0> grant 'alice', 'RW', 'webpageredirects'
0 row(s) in 0.1600 seconds
hbase(main):006:0> grant 'alice', 'RW', 'managed_schemas'
0 row(s) in 0.1570 seconds
hbase(main):007:0> grant 'bob', 'RW', 'webpagesnapshots', 'content',
'public'
0 row(s) in 0.1920 seconds
hbase(main):008:0> grant 'bob', 'RW', 'webpagesnapshots', '_s'
0 row(s) in 0.1510 seconds
hbase(main):009:0> grant 'bob', 'RW', 'webpagesnapshots', 'meta'
0 row(s) in 0.2100 seconds
hbase(main):010:0> grant 'bob', 'RW', 'webpagesnapshots',
'observable'
0 row(s) in 0.1640 seconds
hbase(main):011:0> grant 'bob', 'RW', 'webpageredirects'
0 row(s) in 0.1590 seconds
hbase(main):012:0> grant 'bob', 'RW', 'managed_schemas'
0 row(s) in 0.1870 seconds
hbase(main):013:0>
Grant alice and bob access to their private columns:
[app@snapshots ~]$ kinit -kt ~/app.keytab web-page-snapshots
[app@snapshots ~]$ hbase shell
14/11/13 14:45:53 INFO Configuration.deprecation: hadoop.native.lib
is
deprecated. Instead, use io.native.lib.available
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.98.6, rUnknown, Sat Oct 11 15:15:15 PDT 2014
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Add the following parameters to hbase-site.xml on all of the HBase
nodes to enable user impersonation by the
principal:
![]()
![]()
![]()
![]()

![]()
![]()
![]()
![]()
TIP
There are additional application configuration steps that are unique to the design and
implementation of the demo application. The full set of steps for running the demo are
available in the project’s README on GitHub.
Implementation Notes
In adding security to our application, we made a number of implementation
changes. The full set of changes can be viewed by comparing our demo with
the original Kite SDK example, but we’ll summarize the key changes here.
The first modification was the addition of a Kerberos login module to obtain
a Kerberos TGT using the application’s keytab. This module is loaded by
Spring before initializing the rest of the web application. Here is an
abbreviated version of the module without logging or error checking:
![]()
![]()
![]()
The key takeaways are that we first check that security on our cluster has
been enabled before using the
method on the
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
class. This method will obtain our Kerberos TGT
using the keytab file.
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
The second change required from a Hadoop security standpoint is modifying
the
to impersonate the authenticated user when
communicating with HBase. To accomplish, this we use the method
of the object that represents the proxy user we want
to impersonate. Here is an example of adding impersonation to one of the
methods of the
:

![]()
![]()

![]()
The final required modification is to switch from using a single, shared
connection to HBase to creating a connection per user. This is required due
to the way the HBase client caches connections. The most important
takeaways are to create per-user connections and to set the
![]()
to a unique value in the ![]()
object that HBase will end up using. For this application, we created a utility
method to create and cache our connections:
![]()
![]()
![]()
![]()

![]()
![]()
![]()
![]()
Summary
In this case study, we reviewed the design and architecture of a typical
interactive HBase application. We then looked at the security considerations
(authentication, authorization, impersonation, etc.) associated with our use
case. We also described changes to the data model necessary to support our
authorization model. Next, we summarized the security requirements that we
wanted to add to the application, followed by the steps necessary to
configure our cluster to meet our security requirements. Finally, we
described elements of the application implementation that required changes
to support the security requirements.
Hadoop has come a long way since its inception. As you have seen
throughout this book, security encompasses a lot of material across the
ecosystem. With the boom of big data and the impact it’s having on
businesses that quickly adopt Hadoop as their data platform of choice, it is
no wonder that Hadoop and its wide ecosystem have moved rapidly. That
being said, Hadoop is still very much in its infancy. Even with the many
security configurations available, Hadoop has much to do until it’s on the
level of relational databases and data warehouses to fully meet the needs of
enterprises that have billions of dollars on the line with their data
management.
The good news is that because of Hadoop’s massive growth in the
marketplace, security deficits in the product are rapidly being filled. We
leave you with some things that are either in development right now
(possibly even completed by the time this is published), as well as features
on the horizon that will be a part of the Hadoop ecosystem in the not too
distant future.
One of the hardest jobs a Hadoop security administrator has is to keep track
of how the myriad of components handles access controls. While we
dedicated a good deal of coverage to Apache Sentry as a centralized
authorization component for Hadoop, it is not there yet in terms of providing
authorization across the entire ecosystem. This will happen in the long term
—and it needs to. Security administrators and auditors alike need to have a
single place they can go to view and manage all policies related to user
authorization controls. Without this, it is simply too easy to make mistakes
along the way.
In the very near term, Apache Sentry will have authorization integration for
HDFS. This will allow for a unified way to define policies for data access
when data is shared between components. For example, if data is loaded into
the Hive warehouse and is controlled by Sentry policies, how is that handled
with MapReduce access? As we saw in Chapter 13, this involved using
HDFS-extended ACLs. With HDFS integration with Sentry, this is not
necessary. Instead, HDFS paths can be specified as controlled by Sentry, thus
authorization decisions are determined by Sentry policies, not standard
POSIX permissions or extended ACLs.
Also on the horizon for Sentry is integration with HBase. We saw in
Chapter 6 that authorization policies are stored in a special table in HBase,
and managed via the HBase shell by default. This is a good candidate to
migrate the policy store to Sentry instead.
This book did not cover the larger topic of data governance, but it did go into
asubtopic of it that relates to accounting. As we saw in Chapter 8, there are
audit logs in many different places that capture activity in the cluster.
However, there is not a centralized place to capture auditing holistically, nor
is there a place to perform general data governance tasks such as managing
business metadata, viewing linkages and lineage, or managing data retention.
These features are prominently covered in the traditional data warehouse.
For Hadoop to reach the next level of security as a whole, data governance
needs to be addressed far better than it is today.
In addition to encryption, Hadoop needs native methods for masking and
tokenization.While masking can be done creatively using UDFs or
specialized views,it makes more sense to provide the ability to mask data on
the fly based on predefined policies. This is available today from other
commercial products, but we believe a native capability should be included
as part of Hadoop. Tokenization is not currently possible at all in Hadoop
without commercial products. Tokenization is important for data scientists
especially because they might not need to see specific values of data, but do
need to preserve linkages and other statistical properties in order to do
analysis. This is not possible with masking, but is possible with tokenization.
Hadoop and big data are exciting markets to be in. While it might be a bit
scary for some, especially seasoned security professionals who are
accustomed to more unified security features, we hope this book has shed
some light on the state of Hadoop security and shown that even a large
Hadoop cluster with many components can be protected using a well-planned
security architecture.
A
AAA (authentication, authorization, and accounting), Authentication, and Accounting-Authentication, Authorization, and
Accounting
acceptance filter, The acceptance filter-Hadoop User to Group Mapping
accounting, Authentication, Authorization, and Accounting-
Authentication, Authorization, and Accounting, Accounting-Summary
Accumulo audit logs, Accumulo Audit Logs-Accumulo Audit Logs
active auditing, Accounting
HBase audit logs, HBase Audit Logs-HBase Audit Logs
HDFS audit logs, HDFS Audit Logs-HDFS Audit Logs
Hive audit logs, Hive Audit Logs-Hive Audit Logs
Impala audit logs, Cloudera Impala Audit Logs
log aggregation, Log Aggregation
MapReduce audit logs, MapReduce Audit Logs-MapReduce Audit
Logs
passive auditing, Accounting
security compliance, Accounting
Sentry audit logs, Sentry Audit Logs-Sentry Audit Logs
YARN audit logs, YARN Audit Logs-Hive Audit Logs
Accumulo, Securing Applications-Securing Applications, Accumulo-
audit logs, Accumulo Audit Logs-Accumulo Audit Logs
audited actions, Accumulo Audit Logs
authentication, Username and Password Authentication
authorization, HBase and Accumulo Authorization-Column- and Cell-
cell-level permissions, Column- and Cell-Level Authorization
GarbageCollector, Apache Accumulo
Master, Apache Accumulo
Monitor, Apache Accumulo
namespace-level permissions, System, Namespace, and Table-Level
Authorization
overview, Apache Accumulo-Apache Accumulo
proxy server, Accumulo Proxy Server-Accumulo Proxy Server
rootuser, System, Namespace, and Table-Level Authorization
shell, Accumulo Shell-Accumulo Shell
system-level permissions, System, Namespace, and Table-Level
Authorization
table-level permissions, System, Namespace, and Table-Level
Authorization
TabletServer, Apache Accumulo
Tracer, Apache Accumulo
visibility labels, Apache Accumulo
Accumulo shell, System, Namespace, and Table-Level Authorization
ACLs (access control lists)
extended, HDFS Extended ACLs-HDFS Extended ACLs
Hadoop, Service-Level Authorization
HDFS-extended, Environment Setup
in MapReduce (MR1), MapReduce (MR1)-MapReduce (MR1)
ZooKeeper, ZooKeeper ACLs-Oozie Authorization
active auditing, Accounting
Advanced Encryption Standard (AES), Encryption Algorithms, HDFS
data transfer protocol encryption
allowed.system.users setting, YARN
Apache Accumulo (see Accumulo)
Apache Flume (see Flume)
Apache HBase (see HBase)
Apache HDFS (see HDFS (Hadoop Distributed File System))
Apache Hive (see Hive)
Apache MapReduce (see MapReduce)
Apache Oozie (see Oozie)
Apache Sentry (see Sentry)
Apache Solr (see Solr)
Apache Sqoop (see Sqoop)
Apache YARN (see YARN)
Apache ZooKeeper (see ZooKeeper)
application-level encryption, HDFS Data-at-Rest Encryption
applications, securing, Securing Applications-Securing Applications
architecture (see system architecture)
AS (Authentication Service), Kerberos Overview, Kerberos Overview
audits/auditing (see accounting)
authentication, Confidentiality, Authentication, Authorization, and
Accounting-Authentication, Authorization, and Accounting,
(see also strong authentication)
configuration settings, Configuration-HBase
(see also configuration settings for authentication)
Hue, Hue Authentication
impersonation, Impersonation-Configuration
Kerberos, Authentication
(see also Kerberos)
keytab files, MIT Kerberos
simple versus Kerberos, Kerberos
tokens, Tokens-Job tokens
username and password, Username and Password Authentication
authorization, Authentication, Authorization, and Accounting-
Authentication, Authorization, and Accounting, Hadoop Security: AHistory, Authorization, Summary
HBase and Accumulo, HBase and Accumulo Authorization-Column-
and Cell-Level Authorization
HDFS, HDFS Authorization-HDFS Extended ACLs
Hue, Hue Authorization
MapReduce (MR1), MapReduce and YARN Authorization-
MapReduce (MR1)
MapReduce (YARN/MR2), YARN (MR2)-ZooKeeper ACLs
Oozie, Oozie Authorization-Oozie Authorization
service-level, Service-Level Authorization-Service-Level
Authorization
ZooKeeper, ZooKeeper ACLs-Oozie Authorization
B
Beeline, Hive, Using HiveServer2 with Kerberos authentication, Using
HiveServer2 with LDAP/Active Directory authentication, Environment
Setup
bidirectial trust, Kerberos Trusts
BigTable (Google), Apache HBase
block access tokens, Block access tokens
blocks, Apache HDFS
C
capacity-scheduler.xml, CapacityScheduler
CapacityScheduler, CapacityScheduler-ZooKeeper ACLs
case studies
HBase/interactive web application, Case Study: Interactive HBase Application-Summary
Sentry/multitenancy, Case Study: Hadoop Data Warehouse-Summary
Catalog server (Impala), Cloudera Impala
certificate signing request (CSR), Transport Layer Security, Flume
CIA, Security Overview-Availability
availability, Availability
confidentiality, Confidentiality
integrity, Integrity
CIA model, Threats to Data
client access security
Accumulo and, Accumulo-Accumulo Proxy Server
HBase and, HBase-Accumulo
(see also HBase)
Oozie and, Oozie-Oozie
client access, command-line tools for, Data Extraction and Client Access-Hadoop Command-Line Interface
cloud environments, Operating Environment
Cloudera Hue (see Hue)
Cloudera Impala (see Impala)
clusters, Operating Environment
network traffic in, Host Firewalls
command-line interface, Hadoop Command-Line Interface-Hadoop
Command-Line Interface
command-line tools for client access, Data Extraction and Client Access
Security
confidentiality, Confidentiality
config-tool command (Sentry), Policy File Verification and Validation-
Migrating From Policy Files
configuration parameters
sentry-site.xml, Sentry Service Configuration-Hive Authorization
configuration settings for authentication, Configuration-Summary
example, Configuration-Configuration
HBase, HBase
HDFS, HDFS-YARN
MapReduce (MR1), MapReduce (MR1)-Oozie
Oozie, Oozie-HBase
YARN, YARN-MapReduce (MR1)
container-executor.cfg, YARN
core-site.xml, Mapping Kerberos Principals to Usernames, Hadoop User Group Mapping-Mapping users to groups using LDAP, Configuration,
Service-Level Authorization, Hive Authorization, Configuration,
Hadoop RPC Encryption, Encrypted shuffle and encrypted web UI,
Cloudera Hue
credentials cache, MIT Kerberos
D
data destruction and deletion, Data Destruction and Deletion-Data
Destruction and Deletion
data encryption key (DEK), HDFS Data-at-Rest Encryption
data extraction security, Data Extraction and Client Access Security-
Securing Applications
Hive and, Hive-WebHDFS/HttpFS
Impala and, Impala-Hive
with SQL (see Hive, Impala)
Sqoop and, Sqoop-SQL Access
WebHDFS and HttpFS, WebHDFS/HttpFS-Summary
data gateway nodes, Edge Nodes
data ingest, Securing Data Ingest-Summary
confidentiality of, Data Ingest Confidentiality-Sqoop Encryption
enterprise architecture and, Enterprise Architecture-Enterprise
Architecture
with Flume, Securing Data Ingest-Flume Encryption
from command line, Securing Data Ingest
integrity of data, Integrity of Ingested Data-Integrity of Ingested
Data
Sqoop, Securing Data Ingest, Integrity of Ingested Data
with Sqoop, Sqoop Encryption-Sqoop Encryption
workflows, Ingest Workflows
data integrity (see integrity)
data protection (see data destruction and deletion, encryption)
data transfer encryption, HDFS data transfer protocol encryption
data, threats to, Threats to Data
data-at-rest encryption, Confidentiality, Encrypting Data at Rest-
Important Data Security Consideration for Hadoop
encrypted drives, Encrypting Data at Rest
filesystem, Filesystem Encryption-Filesystem Encryption
full-disk, Full Disk Encryption-Full Disk Encryption
HDFS, Encrypting Data at Rest, HDFS Data-at-Rest Encryption-
Client operations
Impala disk spill, Impala Disk Spill Encryption
intermediate (MapReduce2), MapReduce2 Intermediate Data
Encryption
key management, Encryption and Key Management-Encryption and Management
data-at-rest, defined, Data Protection
data-in-transit encryption, Encrypting Data in Transit-Encrypted shuffleencrypted web UI
encrypted shuffle and encrypted web UI, Encrypted shuffle and
encrypted web UI-Encrypted shuffle and encrypted web UI
HDFS data transfer protocol, HDFS data transfer protocol encryption
HTTP, Hadoop HTTP encryption
transport layer security, Transport Layer Security-SSL/TLS
handshake
data-in-transit, defined, Data Protection
DataNode, Apache HDFS, HDFS
default ACL, HDFS Extended ACLs
default realm, MIT Kerberos
defense in depth, Defense in Depth
delegation tokens, Delegation tokens, Hadoop Command-Line Interface-
denial of service (DoS), Denial of Service
DIGEST-MD5, Kerberos
disk spill encryption, Impala Disk Spill Encryption
distributed denial of service (DDoS), Denial of Service, Intrusion
Detection and Prevention
distributed systems, Securing Distributed Systems-Summary
bank as example of, Securing Distributed Systems
defense in depth, Defense in Depth
threat and risk assessment, Threat and Risk Assessment
environment assessment, Environment Assessment-Environment
Assessment
user assessment, User Assessment
threat categories, Threat Categories-Threats to Data
denial of service (DoS) attacks, Denial of Service
insider threats, Insider Threat
threats to data, Threats to Data
unauthorized access/masquerade, Unauthorized
Access/Masquerade-Unauthorized Access/Masquerade
vulnerabilities, Vulnerabilities-Vulnerabilities
E
edge nodes, Edge Nodes-Edge Nodes, Ingest Workflows-Ingest
Workflows
encrypted DEK (EDEK), HDFS Data-at-Rest Encryption, KMS
authorization
encrypted drives, Encrypting Data at Rest
encryption, Confidentiality, Data Protection-Encrypted shuffle and web UI
algorithms, Encryption Algorithms-Encryption Algorithms
application-level, HDFS Data-at-Rest Encryption
file channel (Flume), Flume Encryption-Flume Encryption
Flume, Flume Encryption-Flume Encryption
key size, Encryption Algorithms
of data-at-rest, Encrypting Data at Rest-Important Data Security for Hadoop
encrypted drives, Encrypting Data at Rest
filesystem, Encrypting Data at Rest, Filesystem Encryption-
full disk, Encrypting Data at Rest, Full Disk Encryption-Full Disk
Encryption
HDFS, Encrypting Data at Rest, HDFS Data-at-Rest Encryption-
Client operations
Impala disk spill, Impala Disk Spill Encryption
intermediate (MapReduce2), MapReduce2 Intermediate Data
Encryption
key management, Encryption and Key Management-Encryption and Management
of data-in-transit, Encrypting Data in Transit-Encrypted shuffle and web UI
encrypted shuffle and encrypted web UI, Encrypted shuffle and
encrypted web UI-Encrypted shuffle and encrypted web UI
HDFS data transfer protocol, HDFS data transfer protocol
encryption
HTTP, Hadoop HTTP encryption
RPC, Hadoop RPC Encryption
transport layer security, Transport Layer Security-SSL/TLS
handshake
Sqoop, Sqoop Encryption-Sqoop Encryption
encryption zone key, HDFS Data-at-Rest Encryption
encryption zones, HDFS Data-at-Rest Encryption
enterprise architecture, Enterprise Architecture-Enterprise
environment assessment, Environment Assessment-Environment
Assessment
/etc/default/solr, Solr Sentry Configuration, Cloudera Hue
extended ACLs, HDFS Extended ACLs-HDFS Extended ACLs
F
fail close, Intrusion Detection and Prevention
fair-scheduler.xml, FairScheduler-FairScheduler
FairScheduler, FairScheduler-CapacityScheduler
filesystem encryption, Encrypting Data at Rest, Filesystem Encryption-
Filesystem Encryption
filtering categories
administration traffic, Administration traffic
client access, Client access
data movement, Data movement
filtering decisions, Network Firewalls
firewalls, Network Firewalls-Administration traffic
host, Host Firewalls-Host Firewalls
Flume, Securing Data Ingest-Integrity of Ingested Data
agents, Apache Flume
SSL encryption with, Flume Encryption-Flume Encryption
forwardable, Client Configuration
freeIPA, Using HiveServer2 with LDAP/Active Directory authentication
full-disk encryption, Encrypting Data at Rest, Full Disk Encryption-Full
Disk Encryption
G
GNU shred, Data Destruction and Deletion
Google File System (GFS), Introduction
grant option, SQL Commands
Gutmann method, Data Destruction and Deletion
H
Hadoop ecosystem, Introduction
authentication methods, Authentication
components overview, Hadoop Components and Ecosystem-Cloudera
Hue, Hadoop Roles and Separation Strategies
Hadoop, evolution of, Introduction
hadoop-policy.xml, Service-Level Authorization, Service-Level
Authorization, Service-Level Authorization
hardware security module (HSM), HDFS Data-at-Rest Encryption
HBase
authorization, HBase and Accumulo Authorization-Column- and Cell-
client access security with, HBase-Accumulo
column-level permissions, Column- and Cell-Level Authorization
configuration settings, HBase
interactive application case study, Case Study: Interactive HBase Application-Summary
overview, Apache HBase-Apache HBase
permissions, System, Namespace, and Table-Level Authorization-
Column- and Cell-Level Authorization
REST gateway, Apache HBase, HBase REST Gateway-HBase Thrift
Gateway
shell, HBase Shell-HBase Shell
Thrift gateway, Apache HBase, HBase Thrift Gateway-Accumulo
hbase-site.xml, HBase REST Gateway, HBase REST Gateway, HBase
REST Gateway, HBase REST Gateway, HBase REST Gateway, HBase
Thrift Gateway, Cluster Configuration, Cluster Configuration
HDFS (Hadoop Distributed File System)
audit logs, HDFS Audit Logs-HDFS Audit Logs
authentication, Kerberos
authorization, HDFS Authorization-HDFS Extended ACLs
configuration, HDFS-YARN
data transfer protocol, HDFS data transfer protocol encryption
DataNode, Apache HDFS
encryption, HDFS Data-at-Rest Encryption-Client operations
client operations, Client operations-Client operations
configuration, Configuration-Configuration
KMS authorization, KMS authorization-KMS authorization
encryption in, Encrypting Data at Rest
extended ACLs, HDFS Extended ACLs-HDFS Extended ACLs,
Environment Setup
HttpFS, Apache HDFS
JournalNode, Apache HDFS
KMS, Apache HDFS
NameNode, Apache HDFS
NFS gateway, Apache HDFS
overview, Apache HDFS-Apache HDFS
service-level authorization properties, Service-Level Authorization-
Service-LevelAuthorization
hdfs-site.xml, HDFS-HDFS, HDFS Extended ACLs, Service-Level
Authorization, Hive Sentry Configuration, Hive Sentry Configuration,
HDFS data transfer protocol encryption
Hive, Hive-WebHDFS/HttpFS
architecture, Hive Authorization-Hive Sentry Configuration
audit logs, Hive Audit Logs-Hive Audit Logs
Beeline, Hive, Using HiveServer2 with Kerberos authentication, Using
HiveServer2 with LDAP/Active Directory authentication,
Environment Setup
Hive Metastore server, Hive Sentry Configuration, Hive Audit Logs
Hive warehouse lockdown, Hive Sentry Configuration
HiveServer2, Apache Hive, Hive Sentry Configuration-Hive Sentry
Configuration
configuration properties, Hive-Using HiveServer2 with Kerberos
authentication
with Kerberos authentication, Using HiveServer2 with Kerberos
authentication-Using HiveServer2 with Kerberos authentication
with LDAP/Active Directory authentication, Using HiveServer2LDAP/Active Directory authentication-Using HiveServer2
with pluggable authentication
over-the-wire encryption, HiveServer2 over-the-wire encryption-
WebHDFS/HttpFS
with pluggable authentication, Using HiveServer2 with pluggable
authentication
versus Impala, Cloudera Impala
impersonation, Hive Sentry Configuration
and impersonation, Impersonation
metastore database, Apache Hive
Metastore server, Apache Hive
overview, Apache Hive
Sentry for authorization, The Sentry Service, Hive Authorization-
Hive Sentry Configuration
hive-env.sh, Using HiveServer2 with LDAP/Active Directory
authentication
hive-site.xml, Hive Sentry Configuration-Hive Sentry Configuration,
Hive Sentry Configuration, Hive-Using HiveServer2 with Kerberos
authentication, Using HiveServer2 with LDAP/Active Directory
authentication, Using HiveServer2 with pluggable authentication,
HiveServer2 over-the-wire encryption
host firewalls, Host Firewalls-Host Firewalls
HTTP encryption, Hadoop HTTP encryption
HttpFS, Apache HDFS, WebHDFS/HttpFS-Summary
HTTPS, Hadoop HTTP encryption, Hue HTTPS
Hue, Cloudera Hue-Summary
authentication, Hue Authentication-LDAP Backend
authorization, Hue Authorization-Hue Authorization
configuring Kerberos in, Cloudera Hue-Cloudera Hue
configuring user impersonation for Oozie, Cloudera Hue
configuring user impersonation for Solr, Cloudera Hue
HTTPS, Hue HTTPS
and impersonation, Impersonation
Kerberos Ticket Renewer, Cloudera Hue
overview, Cloudera Hue
private key, Hue HTTPS
server, Cloudera Hue
SSL client configurations, Hue SSL Client Configurations-Hue SSL Configurations
superusers, Hue Authorization-Hue Authorization
hue.ini, Cloudera Hue, Hue HTTPS-Hue Authentication, LDAP Backend
I
identity, Confidentiality-Authentication, Authorization, and Accounting,
Identity-Provisioning of Hadoop Users
Hadoop user-to-group mapping, Hadoop User to Group Mapping-
Mapping users to groups using LDAP
mapping Kerberos principals to usernames, Mapping Kerberos
Principals to Usernames-Hadoop User to Group Mapping
provisioning of Hadoop users, Provisioning of Hadoop Users
Impala, Apache YARN
architecture, Impala Authorization
audit logs, Cloudera Impala Audit Logs
Catalog server, Cloudera Impala
disk spill encryption, Impala Disk Spill Encryption
versus Hive, Cloudera Impala
with Kerberos authentication, Using Impala with Kerberos
authentication-Using Impala with LDAP/Active Directory
authentication
with LDAP/Active Directory authentication, Using Impala with
LDAP/Active Directory authentication-Using SSL wire encryptionImpala
Sentry for authorization, The Sentry Service, Impala Authorization-
Solr Authorization
SSL wire encryption with, Using SSL wire encryption with Impala-
Hive
impersonation, Impersonation-Configuration, HBase Thrift Gateway,
Cloudera Hue
in-flight encryption, Confidentiality
in-house environments, Operating Environment
ingest pipelines, Data movement, Intrusion Detection and Prevention,
(see also data ingest)
ingested data (see data ingest)
initial principal translations, The initial principal translation
insider threats, Insider Threat
integrity, Integrity
intrusion detection systems (IDS), Intrusion Detection and Prevention-
Intrusion Detection and Prevention
intrusion prevention systems (IPS), Intrusion Detection and Prevention-
Intrusion Detection and Prevention
iptables, Host Firewalls-Host Firewalls
J
Java truststore, Flume Encryption, Sqoop Encryption
job tokens, Job tokens-Job tokens, Service-Level Authorization
JobHistoryServer (YARN), Apache YARN, YARN
JobTracker (MapReduce), Apache MapReduce
job tokens, Job tokens-Job tokens
mapping in, Hadoop User to Group Mapping
JournalNode, Apache HDFS, HDFS
K
KDC (key distribution center), Kerberos Overview-Kerberos Overview
kdestroy, MIT Kerberos
Kerberos, Hadoop Security: A Brief History, Apache MapReduce,
example workflow, Kerberos Workflow: A Simple Example-Kerberos
Workflow: A Simple Example
HiveServer2 with, Using HiveServer2 with Kerberos authentication-
Using HiveServer2 with Kerberos authentication
how it works, Kerberos Overview-Kerberos Overview
Hue and, Cloudera Hue-Cloudera Hue
Impala with, Using Impala with Kerberos authentication-Using Impala LDAP/Active Directory authentication
mapping principals to usernames, Mapping Kerberos Principals to
Usernames-The substitution command
MIT distribution, MIT Kerberos-Client Configuration
client configuration, Client Configuration-Client Configuration
encryption types, Server Configuration
kdestroy, MIT Kerberos
keytab files, MIT Kerberos
klist, MIT Kerberos
server configuration, Server Configuration
naming convention, Kerberos Overview
purpose of, Why Kerberos?-Why Kerberos?
terminology, Kerberos Overview, Kerberos Overview
ticket-granting tickets, Hadoop Command-Line Interface, HBase
Shell
trusts, Kerberos Trusts-Kerberos Trusts
key management systems, Encryption and Key Management-Encryption
and Key Management
keystore file, Oozie
keytab files, MIT Kerberos
kinit, MIT Kerberos, Hadoop Command-Line Interface, HBase Shell,
KMS (key management server), Apache HDFS, HDFS Data-at-Rest
Encryption, HDFS Data-at-Rest Encryption, KMS authorization-KMS
authorization
krb5 (see Kerberos, MIT distribution)
L
LDAP-based authentication, Using HiveServer2 with LDAP/Activeauthentication-Using HiveServer2 with pluggable
authentication
LDAP/Active Directory Hue authrentication backend, LDAP Backend-
LDAP Backend
LdapGroupsMapping, Mapping users to groups using LDAP-Mapping
users to groups using LDAP
Linux, iptables, Host Firewalls-Host Firewalls
LinuxContainerExecutor, YARN
log aggregation, Log Aggregation
log events (see accounting)
LUKS (Linux Unified Key Setup), Full Disk Encryption-Full Disk
Encryption
M
managed environments, Operating Environment
management nodes, Management Nodes-Management Nodes
mapping
Hadoop user-to-group, Hadoop User to Group Mapping-Mapping
users to groups using LDAP
Kerberos principals to usernames, Mapping Kerberos Principals to
Usernames-Hadoop User to Group Mapping
using LDAP, Mapping users to groups using LDAP-Mapping users to using LDAP
mapred-site.xml, YARN-MapReduce (MR1), Service-Level
Authorization-MapReduce (MR1), Encrypted shuffle and encrypted web
UI-Encrypted shuffle and encrypted web UI
MapReduce, Introduction, Apache YARN
ACLs, MapReduce and YARN Authorization
administrator, MapReduce and YARN Authorization
audit logs, MapReduce Audit Logs
authentication, Kerberos
authorization, MapReduce and YARN Authorization-ZooKeeper
ACLs
cluster owner, MapReduce and YARN Authorization
configuration settings, MapReduce (MR1)-Oozie
encrypted shuffle and encrypted web UI, Encrypted shuffle and
encrypted web UI
intermediate data encryption (MR2), MapReduce2 Intermediate Data
Encryption
Job History server, Service-Level Authorization
job owner, MapReduce and YARN Authorization
job submissions in, Apache MapReduce
JobTracker, Apache MapReduce, Job tokens
overview, Apache MapReduce-Apache MapReduce
queue administrator, MapReduce and YARN Authorization
service-level authorization properties, Service-Level Authorization,
Service-LevelAuthorization-Service-Level Authorization
TaskTracker, Apache MapReduce, Job tokens
masquerade attacks, Unauthorized Access/Masquerade-Unauthorized
Access/Masquerade
master nodes, Master Nodes-Master Nodes
metastore (Hive), Apache Hive
Microsoft Active Directory, Hadoop Security: A Brief History
min.user.id setting, YARN
MIT Kerberos, MIT Kerberos-Client Configuration
(see also Kerberos)
N
NameNode, Apache HDFS
authentication, HDFS
and block access tokens, Block access tokens
and delegation tokens, Delegation tokens
mapping in, Hadoop User to Group Mapping
native encryption at rest, HDFS Data-at-Rest Encryption
network firewalls, Network Firewalls-Administration traffic
network security, Network Security-Intrusion Detection and Prevention
firewalls, Network Firewalls-Administration traffic
intrusion detection and prevention, Intrusion Detection and
Prevention-Intrusion Detection and Prevention
segmentation, Network Segmentation-Network Segmentation
network segmentation, Network Segmentation-Network Segmentation
NFS gateway, Apache HDFS
NodeManager (YARN), Apache YARN, YARN
nodes classification, Master Nodes-Edge Nodes
Nutch, Introduction
O
one-way trusts, Kerberos Trusts
Oozie, Ingest Workflows, Oozie-Oozie
authentication, Kerberos
configuration settings, Oozie
Hue and, Cloudera Hue
impersonation, Impersonation
overview, Apache Oozie
oozie-site.xml, Oozie, Oozie Authorization, Cloudera Hue
OpenLDAP, Using HiveServer2 with LDAP/Active Directory
authentication
operating environments, Operating Environment
operating system security, Operating System Security-SELinux
host firewalls, Host Firewalls-Host Firewalls
remote access controls, Remote Access Controls
over-the-wire encryption, Data Protection, HiveServer2 over-the-wire
encryption-WebHDFS/HttpFS
P
passive auditing, Accounting
patches,Vulnerabilities
perimeter security, Defense in Depth
permissions, Authorization-HDFS Extended ACLs
(see also authorization)
POSIX, Authorization-HDFS Extended ACLs
ZooKeeper, ZooKeeper ACLs
ping of death, Vulnerabilities
PKCS #12, Transport Layer Security, Flume Encryption
pluggable authentication, Using HiveServer2 with pluggable
authentication
policy import tool (Sentry), Migrating From Policy Files
POSIX permissions, Authorization-HDFS Extended ACLs
principals, Kerberos Overview
initial principal translations, The initial principal translation
mapping to usernames, Mapping Kerberos Principals to Usernames-
The substitution command
private key, Transport Layer Security, Flume Encryption, Hue HTTPS
provisioning, Provisioning of Hadoop Users
proxying, Impersonation
public key, Transport Layer Security
R
RBAC (role-based access controls), Apache Sentry (Incubating)
realms, Kerberos Overview, MIT Kerberos, The substitution command
renew lifetime, Client Configuration
ResourceManager (YARN), Apache YARN, Hadoop User to Group, YARN
REST server, HBase REST Gateway-HBase Thrift Gateway
risk assessment (see threat and risk assessment)
root user, Accumulo, System, Namespace, and Table-Level Authorization
RPC encryption, Hadoop RPC Encryption-HDFS data transfer protocol
encryption
RSA key exchange algorithm, SSL/TLS handshake
rules format, The substitution command
S
SAML Hue authentication backend, SAML Backend-SAML Backend
SASL (Simple Authentication and Security Layer) framework, Kerberos
schema-on-read, Apache Hive
search bind, LDAP Backend
Secure Socket Layer (SSL), Transport Layer Security-SSL/TLS
handshake
securing applications, Securing Applications-Securing Applications
Security Assertion Markup Language (see SAML Hue authentication
backend)
security compliance, Accounting
Security Enhanced Linux (see SELinux)
security history, Hadoop Security: A Brief History-Hadoop Security: A History
security overview, Security Overview-Authentication, Authorization,
and Accounting
segmentation, Network Segmentation-Network Segmentation
SELinux, SELinux-SELinux
Sentry, Hadoop Security: A Brief History
audit logs, Sentry Audit Logs
concepts, Sentry Concepts-Sentry Concepts
entity relationships, Sentry Concepts
groups, Sentry Concepts-Sentry Concepts
for Hive, Hive Authorization-Hive Sentry Configuration
for Impala, Impala Authorization-Solr Authorization
models, Sentry Concepts
multitenancy case study, Case Study: Hadoop Data Warehouse-
overview, Apache Sentry (Incubating)
policy administration, Sentry Policy Administration-Summary
Solr policy file, Solr Policy File
SQL commands for, SQL Commands-SQL Commands
SQL policy file, SQL Policy File-SQL Policy File
verification and validation, Policy File Verification and Validation-
Migrating From Policy Files
policy engine, Sentry Concepts
policy provider, Sentry Concepts
privileges, Sentry Concepts-Sentry Concepts
Sentry server, Apache Sentry (Incubating)
Sentry service, The Sentry Service
architecture, The Sentry Service
configuration and examples, Sentry Service Configuration-Hive
policy administration, Sentry Policy Administration-SQL Commands
Solr privilege model, SQL Privilege Model-Sentry Policy
Administration
SQL privilege model, Sentry Privilege Models-SQL Privilege Model
users, Sentry Concepts-Sentry Concepts
sentry-provider.ini, Hive Sentry Configuration, SQL Policy File-SQL
Policy File
sentry-site.xml, Sentry Service Configuration-Hive Authorization, Hive
Sentry Configuration-Hive Sentry Configuration, Impala Sentry
Configuration-Solr Sentry Configuration, SQL Policy File
service ports, common, Host Firewalls-Host Firewalls
service-level authorization, Service-Level Authorization
default policies example, Service-Level Authorization-Service-Level
Authorization
deleting user files example, Service-Level Authorization-Service-
Level Authorization
MapReduce Job History server, Service-Level Authorization
recommended policies example, Service-Level Authorization-Service-
setgid permissions, HDFS Authorization
setuid permissions, HDFS Authorization
shred, Data Destruction and Deletion
signed certificate, Transport Layer Security
Simple and Protected GSSAPI Negotiation Mechanism (see SPNEGO)
simple authentication, Kerberos
software vulnerability, Vulnerabilities
Solr
document-level authorization, Solr Sentry Configuration
overview, Apache Solr
Sentry for authorization, Solr Authorization-Solr Sentry
Configuration
Sentry policy administration with, Solr Policy File
Sentry privilege model, SQL Privilege Model-Sentry Policy
Administration
solrconfig.xml, Solr Sentry Configuration
Spark, Apache YARN
SPN (service principal name), Kerberos Overview, Configuration
SPNEGO, Kerberos, SPNEGO Backend-SPNEGO Backend
SQL
Sentry policy-based administration, SQL Policy File-SQL Policy File
Sentry privilege model, SQL Privilege Model-SQL Privilege Model
Sentry server policy administration, SQL Commands-SQL Policy File
SQL access, SQL Access-WebHDFS/HttpFS
(see also Hive, Impala)
Sqoop, Apache Sqoop, Securing Data Ingest, Integrity of Ingested Data,
Sqoop Encryption-Sqoop Encryption, Sqoop-SQL Access
SSH, Authentication, Authorization, and Accounting
SSHD, Authentication, Authorization, and Accounting
ssl-client.xml, Encrypted shuffle and encrypted web UI, Encrypted
shuffle and encrypted web UI
ssl-server.xml, Encrypted shuffle and encrypted web UI-Encrypted
shuffle and encrypted web UI
standard permissions, HDFS Authorization-HDFS Authorization
StateStore (Impala), Cloudera Impala
sticky permissions, HDFS Authorization
strong authentication, Kerberos-Summary
(see also Kerberos)
substitution command, The substitution command
sudo command, Authentication, Authorization, and Accounting
system architecture, System Architecture-Summary
Hadoop roles and separation strategies, Hadoop Roles and Separation-Edge Nodes
network security, Network Security-Intrusion Detection and
Prevention
nodes classification, Master Nodes-Edge Nodes
operating environment, Operating Environment-Operating
Environment
operating system security, Operating System Security-SELinux
T
tasks (MapReduce), Apache MapReduce
TaskTracker (MapReduce), Apache MapReduce, Job tokens
TGS (Ticket Granting Service), Kerberos Overview-Kerberos
Overview
TGT (ticket-granting ticket), Kerberos Overview
threat and risk assessment, Threat and Risk Assessment-Environment
Assessment
environment assessment, Environment Assessment-Environment
Assessment
user assessment, User Assessment
threat categories, in distributed systems, Threat Categories-Threats to
Data
(see also distributed systems)
ticket lifetime, Client Configuration
token renewer, Delegation tokens
tokens, Tokens-Job tokens, Service-Level Authorization
Transport Layer Security (TLS), Transport Layer Security-SSL/TLS
handshake
trusts, Kerberos Trusts-Kerberos Trusts
truststore, Flume Encryption, Sqoop Encryption
two-way trusts, Kerberos Trusts
U
unauthorized access attacks, Unauthorized Access/Masquerade-
Unauthorized Access/Masquerade
UPNs (user principal names), Kerberos Overview
user assessment, User Assessment
user-to-group mapping, Hadoop User to Group Mapping-Mapping usersgroups using LDAP
username and password authentication, Username and Password
Authentication
usernames, Kerberos, Mapping Kerberos Principals to Usernames-The
substitution command
V
visibility labels (Accumulo), Apache Accumulo
VLANs (virtual local area networks), Network Segmentation
vulnerabilities, Unauthorized Access/Masquerade
W
WebHDFS, HDFS, WebHDFS/HttpFS-Summary
WITH GRANT OPTION, SQL Commands
worker nodes, Worker Nodes-Worker Nodes
Y
audit logs, YARN Audit Logs-Hive Audit Logs
authentication, Kerberos
authorization (MR2), YARN (MR2)-ZooKeeper ACLs
CapacityScheduler, CapacityScheduler-ZooKeeper ACLs
cluster owner, MapReduce and YARN Authorization
configuration, YARN-MapReduce (MR1)
FairScheduler, FairScheduler-CapacityScheduler
overview, Apache YARN
service-level authorization properties, Service-Level Authorization-
yarn-site.xml, YARN-YARN, FairScheduler, CapacityScheduler
Z
ZooKeeper
ACLs, ZooKeeper ACLs-Oozie Authorization
authentication, Kerberos-Username and Password Authentication
overview, Apache ZooKeeper
About the Authors
Ben Spivey is currently a solutions architect at Cloudera. During his time
with Cloudera, he has worked in a consulting capacity to assist customers
with their Hadoop deployments. Ben has worked with many Fortune 500
companies across multiple industries, including financial services, retail, and
health care. His primary expertise is the planning, installation, configuration,
and securing of customers’ Hadoop clusters.
Prior to Cloudera, Ben worked for the National Security Agency and with a
defense contractor as a software engineer. During this time, Ben built
applications that, among other things, integrated with enterprise security
infrastructure to protect sensitive information.
Joey Echeverria is a software engineer at Rocana where he builds the next
generation of IT Operations Analytics on the Apache Hadoop platform. Joey
is also a committer on the Kite SDK, an Apache-licensed data API for the
Hadoop ecosystem. Joey was previously a software engineer at Cloudera
where he contributed to a number of ASF projects including Apache Flume,
Apache Sqoop, Apache Hadoop, and Apache HBase.
Colophon
The animal on the cover of Hadoop Security is a Japanese badger (Meles), in the same family as weasels. As its name suggests, it’s endemic
to Japan; it is found on Honshu, Kyushu, Shikoku, and Shodoshima.
Japanese badgers are small compared to its European counterparts. Males
are about 31 inches in length and females are a little smaller at an average of
28 inches. Other than the size of their canine teeth, males and females don’t
differ much physically. Adults weigh about 8.8 to 17.6 pounds, and have
blunt torsos with short limbs. The badger has powerful digging claws on its
front feet and smaller hind feet. Though not as distinct as on the European
badger, the Japanese badger has the characteristic black and white stripes on
its face.
Japanese badgers are nocturnal and hibernate during the winter. Once
females are two years old, they mate and birth litters up to two or three cubs
in the spring. Compared to their European counterparts, Japanese badgers are
more solitary; mates don’t form pair bonds.
Japanese badgers inhabit a variety of woodland and forest habitats, where
they eat an omnivorous diet of worms, beetles, berries, and persimmons.
Many of the animals on O’Reilly covers are endangered; all of them are
important to the world. To learn more about how you can help, go to
animals.oreilly.com.
The cover image is from loose plates, source is unknown. The cover fonts
are URW Typewriter and Guardian Sans. The text font is Adobe Minion Pro;
the heading font is Adobe Myriad Condensed; and the code font is Dalton
Maag’s Ubuntu Mono.




For your convenience Apress has placed some of the front
matter material after the index. Please use the Bookmarks
and Contents at a Glance links to access them.
![]()
![]()
Contents at a Glance
About the Author xiii
About the Technical Reviewer xv
Part I: Introducing Hadoop and Its Security 1
Chapter 1: Understanding Security Concepts 3
Chapter 2: Introducing Hadoop 19
Chapter 3: Introducing Hadoop Security 37
Part II: Authenticating and Authorizing Within Your Hadoop Cluster 49
Chapter 4: Open Source Authentication in Hadoop 51
Chapter 5: Implementing Granular Authorization 75
Part III: Audit Logging and Security Monitoring 95
Chapter 6: Hadoop Logs: Relating and Interpretation 97
Chapter 7: Monitoring in Hadoop 119
Part IV: Encryption for Hadoop 143
Chapter 8: Encryption in Hadoop 145
v
CONTENTS AT A GLANCE
Part V: Appendices 169
Appendix A: Pageant Use and Implementation 171
Appendix B: PuTTY and SSH Implementation for Linux-Based Clients 177
Appendix C: Setting Up a KeyStore and TrustStore for HTTP Encryption 181
Appendix D: Hadoop Metrics and Their Relevance to Security 183
Index 191
vi
Last year, I was designing security for a client who was looking for a reference book that talked about security
implementations in the Hadoop arena, simply so he could avoid known issues and pitfalls. To my chagrin, I couldn’t
locate a single book for him that covered the security aspect of Hadoop in detail or provided options for people who
were planning to secure their clusters holding sensitive data! I was disappointed and surprised. Everyone planning to
secure their Hadoop cluster must have been going through similar frustration. So I decided to put my security design
experience to broader use and write the book myself.
As Hadoop gains more corporate support and usage by the day, we all need to recognize and focus on the
security aspects of Hadoop. Corporate implementations also involve following regulations and laws for data
protection and confidentiality, and such security issues are a driving force for making Hadoop “corporation ready.”
Open-source software usually lacks organized documentation and consensus on performing a particular
functional task uniquely, and Hadoop is no different in that regard. The various distributions that mushroomed in last
few years vary in their implementation of various Hadoop functions, and some, such as authorization or encryption,
are not even provided by all the vendor distributions. So, in this way, Hadoop is like Unix of the ’80s or ’90s: Open
source development has led to a large number of variations and in some cases deviations from functionality. Because
of these variations, devising a common strategy to secure your Hadoop installation is difficult. In this book, I have
tried to provide a strategy and solution (an open source solution when possible) that will apply in most of the cases,
but exceptions may exist, especially if you use a Hadoop distribution that’s not well-known.
It’s been a great and exciting journey developing this book, and I deliberately say “developing,” because I believe
that authoring a technical book is very similar to working on a software project. There are challenges, rewards, exciting
developments, and of course, unforeseen obstacles—not to mention deadlines!
Who This Book Is For
This book is an excellent resource for IT managers planning a production Hadoop environment or Hadoop
administrators who want to secure their environment. This book is also for Hadoop developers who wish to
implement security in their environments, as well as students who wish to learn about Hadoop security. This book
assumes a basic understanding of Hadoop (although the first chapter revisits many basic concepts), Kerberos,
relational databases, and Hive, plus an intermediate-level understanding of Linux.
How This Book Is Structured
The book is divided in five parts: Part I, “Introducing Hadoop and Its Security,” contains Chapters 1, 2, and 3; Part II,
“Authenticating and Authorizing Within Your Hadoop Cluster,” spans Chapters 4 and 5; Part III, “Audit Logging and
Security Monitoring,” houses Chapters 6 and 7; Part IV, “Encryption for Hadoop,” contains Chapter 8; and Part V holds
the four appendices.
xix
INTRODUCTION
Here’s a preview of each chapter in more detail:
Chapter 1, “Understanding Security Concepts,” oGers an overview of security, the security
engineering framework, security protocols (including Kerberos), and possible security attacks.
mis chapter also explains how to secure a distributed system and discusses Microsoft SQL
Server as an example of secure system.
Chapter 2, “Introducing Hadoop,” introduces the Hadoop architecture and Hadoop
Distributed File System (HDFS), and explains the security issues inherent to HDFS and why
it’s easy to break into a HDFS installation. It also introduces Hadoop’s MapReduce framework
and discusses its security shortcomings. Last, it discusses the Hadoop Stack.
Chapter 3, “Introducing Hadoop Security,” serves as a roadmap to techniques for designing
and implementing security for Hadoop. It introduces authentication (using Kerberos) for
providing secure access, authorization to specify the level of access, and monitoring for
unauthorized access or unforeseen malicious attacks (using tools like Ganglia or Nagios).
You’ll also learn the importance of logging all access to Hadoop daemons (using the Log4j
logging system) and importance of data encryption (both in transit and at rest).
Chapter 4, “Open Source Authentication in Hadoop,” discusses how to secure your Hadoop
cluster using open source solutions. It starts by securing a client using PuTTY, then describes
the Kerberos architecture and details a Kerberos implementation for Hadoop step by step. In
addition, you’ll learn how to secure interprocess communication that uses the RPC (remote
procedure call) protocol, how to encrypt HTTP communication, and how to secure the data
communication that uses DTP (data transfer protocol).
Chapter 5, “Implementing Granular Authorization,” starts with ways to determine
security needs (based on application) and then examines methods to design fine-grained
authorization for applications. Directory- and file-level permissions are demonstrated using
a real-world example, and then the same example is re-implemented using HDFS Access
Control Lists and Apache Sentry with Hive.
Chapter 6, “Hadoop Logs: Relating and Interpretation,” discusses the use of logging for
security. After a high-level discussion of the Log4j API and how to use it for audit logging, the
chapter examines the Log4j logging levels and their purposes. You’ll learn how to correlate
Hadoop logs to implement security eGectively, get a look at Hadoop analytics and a possible
implementation using Splunk.
Chapter 7, “Monitoring in Hadoop,” discusses monitoring for security. It starts by discussing
features that a monitoring system needs, with an emphasis on monitoring distributed clusters.
mereafter, it discusses the Hadoop metrics you can use for security purposes and examines
the use of Ganglia and Nagios, the two most popular monitoring applications for Hadoop. It
concludes by discussing some helpful plug-ins for Ganglia and Nagios that provide security-
related functionality and also discusses Ganglia integration with Nagios.
Chapter 8, “Encryption in Hadoop,” begins with some data encryption basics, discusses
popular encryption algorithms and their applications (certificates, keys, hash functions,
digital signatures), defines what can be encrypted for a Hadoop cluster, and lists some of the
popular vendor options for encryption. A detailed implementation of HDFS and Hive data at
rest follows, showing Intel’s distribution in action. me chapter concludes with a step-by-step
implementation of encryption at rest using Elastic MapReduce VM (EMR) from Amazon Web
Services.
xx
Downloading the Code
INTRODUCTION
me source code for this book is available in ZIP file format in the Downloads section of the Apress web site
(www.apress.com).
Contacting the Author
You can reach Bhushan Lakhe at blakhe@aol.com or bclakhe@gmail.com.
xxi
PART I
Introducing Hadoop and Its Security
CHAPTER 1
Understanding Security Concepts
In today’s technology-driven world, computers have penetrated all walks of our life, and more of our personal and
corporate data is available electronically than ever. Unfortunately, the same technology that provides so many
benefits can also be used for destructive purposes. In recent years, individual hackers, who previously worked mostly
for personal gain, have organized into groups working for financial gain, making the threat of personal or corporate
data being stolen for unlawful purposes much more serious and real. Malware infests our computers and redirects
our browsers to specific advertising web sites depending on our browsing context. Phishing emails entice us to log
into web sites that appear real but are designed to steal our passwords. Viruses or direct attacks breach our networks
to steal passwords and data. As Big Data, analytics, and machine learning push into the modern enterprise, the
opportunities for critical data to be exposed and harm to be done rise exponentially.
If you want to counter these attacks on your personal property (yes, your data is your personal property) or your
corporate property, you have to understand thoroughly the threats as well as your own vulnerabilities. Only then can
you work toward devising a strategy to secure your data, be it personal or corporate.
Think about a scenario where your bank’s investment division uses Hadoop for analyzing terabytes of data and
your bank’s competitor has access to the results. Or how about a situation where your insurance company decides
to stop offering homeowner’s insurance based on Big Data analysis of millions of claims, and their competitor, who
has access (by stealth) to this data, finds out that most of the claims used as a basis for analysis were fraudulent? Can
you imagine how much these security breaches would cost the affected companies? Unfortunately, only the breaches
highlight the importance of security. To its users, a good security setup—be it personal or corporate—is always
transparent.
This chapter lays the foundation on which you can begin to build that security strategy. I first define a security
engineering framework. Then I discuss some psychological aspects of security (the human factor) and introduce
security protocols. Last, I present common potential threats to a program’s security and explain how to counter
those threats, offering a detailed example of a secure distributed system. So, to start with, let me introduce you to the
concept of security engineering.
Introducing Security Engineering
Security engineering is about designing and implementing systems that do not leak private information and can
reliably withstand malicious attacks, errors, or mishaps. As a science, it focuses on the tools, processes, and methods
needed to design and implement complete systems and adapt existing systems.
Security engineering requires expertise that spans such dissimilar disciplines as cryptography, computer
security, computer networking, economics, applied psychology, and law. Software engineering skills (ranging from
business process analysis to implementation and testing) are also necessary, but are relevant mostly for countering
error and “mishaps”—not for malicious attacks. Designing systems to counter malice requires specialized skills and,
of course, specialized experience.
3
CHAPTER 1 ■ UNDERSTANDING SECURITY CONCEPTS
Security requirements vary from one system to another. Usually you need a balanced combination of user
authentication, authorization, policy definition, auditing, integral transactions, fault tolerance, encryption, and
isolation. A lot of systems fail because their designers focus on the wrong things, omit some of these factors, or
focus on the right things but do so inadequately. Securing Big Data systems with many components and interfaces
is particularly challenging. A traditional database has one catalog, and one interface: SQL connections. A Hadoop
system has many “catalogs” and many interfaces (Hadoop Distributed File System or HDFS, Hive, HBase). This
increased complexity, along with the varied and voluminous data in such a system, introduces many challenges for
security engineers.
Securing a system thus depends on several types of processes. To start with, you need to determine your security
requirements and then how to implement them. Also, you have to remember that secure systems have a very
important component in addition to their technical components: the human factor! That’s why you have to make sure
that people who are in charge of protecting the system and maintaining it are properly motivated. In the next section,
I define a framework for considering all these factors.
Security Engineering Framework
Good security engineering relies on the following five factors to be considered while conceptualizing a system:
Strategy: Your strategy revolves around your objective. A specific objective is a good
starting point to define authentication, authorization, integral transactions, fault tolerance,
encryption, and isolation for your system. You also need to consider and account for possible
error conditions or malicious attack scenarios.
Implementation: Implementation of your strategy involves procuring the necessary hardware
and software components, designing and developing a system that satisfies all your objectives,
defining access controls, and thoroughly testing your system to match your strategy.
Reliability: Reliability is the amount of reliance you have for each of your system components
and your system as a whole. Reliability is measured against failure as well as malfunction.
Relevance: Relevance decides the ability of a system to counter the latest threats. For it to
remain relevant, especially for a security system, it is also extremely important to update it
periodically to maintain its ability to counter new threats as they arise.
Motivation: Motivation relates to the drive or dedication that the people responsible for
managing and maintaining your system have for doing their job properly, and also refers to
the lure for the attackers to try to defeat your strategy.
Figure 1-1 illustrates how these five factors interact.
Strategy
Implementation
Reliability
Relevance
Motivation

Figure 1-1. Five factors to consider before designing a security framework
4
CHAPTER 1 ■ UNDERSTANDING SECURITY CONCEPTS
Notice the relationships, such as strategy for relevance, implementation of a strategy, implementation of
relevance, reliability of motivation, and so on.
Consider Figure 1-1’s framework through the lens of a real-world example. Suppose I am designing a system to
store the grades of high school students. How do these five key factors come into play?
With my objective in mind—create a student grading system—I first outline a strategy for the system. To begin,
I must define levels of authentication and authorization needed for students, staff, and school administrators (the
access policy). Clearly, students need to have only read permissions on their individual grades, staff needs to have
read and write permissions on their students’ grades, and school administrators need to have read permissions on
all student records. Any data update needs to be an integral transaction, meaning either it should complete all the
related changes or, if it aborts while in progress, then all the changes should be reverted. Because the data is sensitive,
it should be encrypted—students should be able to see only their own grades. The grading system should be isolated
within the school intranet using an internal firewall and should prompt for authentication when anyone tries to use it.
My strategy needs to be implemented by first procuring the necessary hardware (server, network cards) and
software components (SQL Server, C#, .NET components, Java). Next is design and development of a system to meet
the objectives by designing the process flow, data flow, logical data model, physical data model using SQL Server, and
graphical user interface using Java. I also need to define the access controls that determine who can access the system
and with what permissions (roles based on authorization needs). For example, I define the School_Admin role with
read permissions on all grades, the Staff role with read and write permissions, and so on. Last, I need to do a security
practices review of my hardware and software components before building the system.
While thoroughly testing the system, I can measure reliability by making sure that no one can access data they
are not supposed to, and also by making sure all users can access the data they are permitted to access. Any deviation
from this functionality makes the system unreliable. Also, the system needs to be available 24/7. If it’s not, then that
reduces the system’s reliability, too. This system’s relevance will depend on its impregnability. In other words, no
student (or outside hacker) should be able to hack through it using any of the latest techniques.
The system administrators in charge of managing this system (hardware, database, etc.) should be reliable and
motivated to have good professional integrity. Since they have access to all the sensitive data, they shouldn’t disclose
it to any unauthorized people (such as friends or relatives studying at the high school, any unscrupulous admissions
staff, or even the media). Laws against any such disclosures can be a good motivation in this case; but professional
integrity is just as important.
Psychological Aspects of Security Engineering
Why do you need to understand the psychological aspects of security engineering? The biggest threat to your online
security is deception: malicious attacks that exploit psychology along with technology. We’ve all received phishing
e-mails warning of some “problem” with a checking, credit card, or PayPal account and urging us to “fix” it by logging
into a cleverly disguised site designed to capture our usernames, passwords, or account numbers for unlawful
purposes. Pretexting is another common way for private investigators or con artists to steal information, be it personal
or corporate. It involves phoning someone (the victim who has the information) under a false pretext and getting the
confidential information (usually by pretending to be someone authorized to have that information). There have been
so many instances where a developer or system administrator got a call from the “security administrator” and were
asked for password information supposedly for verification or security purposes. You’d think it wouldn’t work today,
but these instances are very common even now! It’s always best to ask for an e-mailed or written request for disclosure
of any confidential or sensitive information.
Companies use many countermeasures to combat phishing:
Password Scramblers: A number of browser plug-ins encrypt your password to a strong,
domain-specific password by hashing it (using a secret key) and the domain name of the
web site being accessed. Even if you always use the same password, each web site you visit
will be provided with a different, unique password. Thus, if you mistakenly enter your Bank
of America password into a phishing site, the hacker gets an unusable variation of your real
password.
5
CHAPTER 1 ■ UNDERSTANDING SECURITY CONCEPTS
Client Certificates or Custom-Built Applications: Some banks provide their own laptops and
VPN access for using their custom applications to connect to their systems. They validate the
client’s use of their own hardware (e.g., through a media access control, or MAC address) and
also use VPN credentials to authenticate the user before letting him or her connect to their
systems. Some banks also provide client certificates to their users that are authenticated by
their servers; because they reside on client PCs, they can’t be accessed or used by hackers.
Two-Phase Authentication: With this system, logon involves both a token password and
a saved password. Security tokens generate a password (either for one-time use or time
based) in response to a challenge sent by the system you want to access. For example, every
few seconds a security token can display a new eight-digit password that’s synchronized
with the central server. After you enter the token password, the system then prompts for
a saved password that you set up earlier. This makes it impossible for a hacker to use your
password, because the token password changes too quickly for a hacker to use it. Two-phase
authentication is still vulnerable to a real-time “man-in-the-middle” attack (see the
“Man-in-the-Middle Attack” sidebar for more detail).
MAN-IN-THE-MIDDLE ATTACK
A man-in-the-middle attack works by a hacker becoming an invisible relay (the “man in the middle”) between a
legitimate user and authenticator to capture information for illegal use. The hacker (or “phisherman”) captures the
user responses and relays them to the authenticator. He or she then relays any challenges from the authenticator
to the user, and any subsequent user responses to the authenticator. Because all responses pass through the
hacker, he is authenticated as a user instead of the real user, and hence is free to perform any illegal activities
while posing as a legitimate user!
For example, suppose a user wants to log in to his checking account and is enticed by a phishing scheme to
log into a phishing site instead. The phishing site simultaneously opens a logon session with the user’s bank.
When the bank sends a challenge; the phisherman relays this to the user, who uses his device to respond to it;
the phisherman relays this response to the bank, and is now authenticated to the bank as the user! After that,
of course, he can perform any illegal activities on that checking account, such as transferring all the money to his
own account.
Some banks counter this by using an authentication code based on last amount withdrawn, the payee account
number, or a transaction sequence number as a response, instead of a simple response.
Trusted Computing: This approach involves installing a TPM (trusted platform module)
security chip on PC motherboards. TPM is a dedicated microprocessor that generates
cryptographic keys and uses them for encryption/decryption. Because localized hardware is
used for encryption, it is more secure than a software solution. To prevent any malicious code
from acquiring and using the keys, you need to ensure that the whole process of encryption/
decryption is done within TPM rather than TPM generating the keys and passing them to
external programs. Having such hardware transaction support integrated into the PC will
make it much more difficult for a hacker to break into the system. As an example, the recent
Heartbleed bug in OpenSSL would have been defeated by a TPM as the keys would not be
exposed in system memory and hence could not have been leaked.
6
CHAPTER 1 ■ UNDERSTANDING SECURITY CONCEPTS
Strong Password Protocols: Steve Bellovin and Michael Merritt came up with a series of
protocols for encrypted key exchange, whereby a key exchange is combined with a shared
password in such a way that a man in the middle (phisherman) can’t guess the password.
Various other researchers came up with similar protocols, and this technology was a precursor
to the “secure” (HTTPS) protocol we use today. Since use of HTTPS is more convenient, it was
implemented widely instead of strong pass word protocol, which none of today’s browsers
implement.
Two-Channel Authentication: This involves sending one-time access codes to users via a
separate channel or a device (such as their mobile phone). This access code is used as an
additional password, along with the regular user password. This authentication is similar to
two-phase authentication and is also vulnerable to real-time man-in-the-middle attack.
Introduction to Security Protocols
A security system consists of components such as users, companies, and servers, which communicate using a number
of channels including phones, satellite links, and networks, while also using physical devices such as laptops, portable
USB drives, and so forth. Security protocols are the rules governing these communications and are designed to
effectively counter malicious attacks.
Since it is practically impossible to design a protocol that will counter all kinds of threats (besides being
expensive), protocols are designed to counter only certain types of threats. For example, the Kerberos protocol that’s
used for authentication assumes that the user is connecting to the correct server (and not a phishing web site) while
entering a name and password.
Protocols are often evaluated by considering the possibility of occurrence of the threat they are designed to
counter, and their effectiveness in negating that threat.
Multiple protocols often have to work together in a large and complex system; hence, you need to take care
that the combination doesn’t open any vulnerabilities. I will introduce you to some commonly used protocols in the
following sections.
The Needham–Schroeder Symmetric Key Protocol
The Needham–Schroeder Symmetric Key Protocol establishes a session key between the requestor and authenticator
and uses that key throughout the session to make sure that the communication is secure. Let me use a quick example
to explain it.
A user needs to access a file from a secure file system. As a first step, the user requests a session key to the
authenticating server by providing her nonce (a random number or a serial number used to guarantee the freshness
of a message) and the name of the secure file system to which she needs access (step 1 in Figure 1-2). The server
provides a session key, encrypted using the key shared between the server and the user. The session key also contains
the user’s nonce, just to confirm it’s not a replay. Last, the server provides the user a copy of the session key encrypted
using the key shared between the server and the secure file system (step 2). The user forwards the key to the secure
file system, which can decrypt it using the key shared with the server, thus authenticating the session key (step 3). The
secure file system sends the user a nonce encrypted using the session key to show that it has the key (step 4). The user
performs a simple operation on the nonce, re-encrypts it, and sends it back, verifying that she is still alive and that she
holds the key. Thus, secure communication is established between the user and the secure file system.
The problem with this protocol is that the secure file system has to assume that the key it receives from
authenticating server (via the user) is fresh. This may not be true. Also, if a hacker gets hold of the user’s key, he could
use it to set up session keys with many other principals. Last, it’s not possible for a user to revoke a session key in case
she discovers impersonation or improper use through usage logs.
To summarize, the Needham–Schroeder protocol is vulnerable to replay attack, because it’s not possible to
determine if the session key is fresh or recent.
7
Authenticating
Server
![]()
CHAPTER 1 ■ UNDERSTANDING SECURITY CONCEPTS
![]()
Server)
Secure file
system
User forwards the encrypted session
key to secure file system
![]()
User requests a session key
by providing her "nonce"
Server responds with "nonce", session key
encrypted using shared key (between User
/ Server), session key encrypted using
shared key (between secure file system /
Server sends shared key between
itself and the secure file system
User
Secure file system sends user a “nonce’’
encrypted using the session key
![]()
![]()
Kerberos
Figure 1-2. Needham–Schroeder Symmetric Key Protocol
A derivative of the Needham–Schroeder protocol, Kerberos originated at MIT and is now used as a standard
authentication tool in Linux as well as Windows. Instead of a single trusted server, Kerberos uses two: an
authentication server that authenticates users to log in; and a ticket-granting server that provides tickets, allowing
access to various resources (e.g., files or secure processes). This provides more scalable access management.
What if a user needs to access a secure file system that uses Kerberos? First, the user logs on to the authentication
server using a password. The client software on the user’s PC fetches a ticket from this server that is encrypted
under the user’s password and that contains a session key (valid only for a predetermined duration like one hour or
one day). Assuming the user is authenticated, he now uses the session key to get access to secure file system that’s
controlled by the ticket-granting server.
Next, the user requests access to the secure file system from the ticket-granting server. If the access is permissible
(depending on user’s rights), a ticket is created containing a suitable key and provided to the user. The user also gets
a copy of the key encrypted under the session key. The user now verifies the ticket by sending a timestamp to the
secure file system, which confirms it’s alive by sending back the timestamp incremented by 1 (this shows it was able to
decrypt the ticket correctly and extract the key). After that, the user can communicate with the secure file system.
Kerberos fixes the vulnerability of Needham–Schroeder by replacing random nonces with timestamps.
Of course, there is now a new vulnerability based on timestamps, in which clocks on various clients and servers
might be desynchronized deliberately as part of a more complex attack.
Kerberos is widely used and is incorporated into the Windows Active Directory server as its authentication
mechanism. In practice, Kerberos is the most widely used security protocol, and other protocols only have a
historical importance. You will learn more about Kerberos in later chapters, as it is the primary authentication used
with Hadoop today.
8
Burrows–Abadi–Needham Logic
CHAPTER 1 ■ UNDERSTANDING SECURITY CONCEPTS
Burrows–Abadi–Needham (BAN) logic provides framework for defining and analyzing sensitive information. The
underlying principle is that a message is authentic if it meets three criteria: it is encrypted with a relevant key, it’s from
a trusted source, and it is also fresh (that is, generated during the current run of the protocol). The verification steps
followed typically are to
Check if origin is trusted,
Check if encryption key is valid, and
Check timestamp to make sure it’s been generated recently.
Variants of BAN logic are used by some banks (e.g., the COPAC system used by Visa International). BAN logic is a
very extensive protocol due to its multistep verification process; but that’s also the precise reason it’s not very popular.
It is complex to implement and also vulnerable to timestamp manipulation (just like Kerberos).
Consider a practical implementation of BAN logic. Suppose Mindy buys an expensive purse from a web retailer
and authorizes a payment of $400 to the retailer through her credit card. Mindy’s credit card company must be able
to verify and prove that the request really came from Mindy, if she should later disavow sending it. The credit card
company also wants to know that the request is entirely Mindy's, that it has not been altered along the way.
In addition, the company must be able to verify the encryption key (the three-digit security code from the credit card)
Mindy entered. Last, the company wants to be sure that the message is new—not a reuse of a previous message.
So, looking at the requirements, you can conclude that the credit card company needs to implement BAN logic.
Now, having reviewed the protocols and ways they can be used to counter malicious attacks, do you think using a
strong security protocol (to secure a program) is enough to overcome any “flaws” in software (that can leave programs
open to security attacks)? Or is it like using an expensive lock to secure the front door of a house while leaving the
windows open? To answer that, you will first need to know what the flaws are or how they can cause security issues.
Securing a Program
Before you can secure a program, you need to understand what factors make a program insecure. To start with, using
security protocols only guards the door, or access to the program. Once the program starts executing, it needs to have
robust logic that will provide access to the necessary resources only, and not provide any way for malicious attacks
to modify system resources or gain control of the system. So, is this how a program can be free of flaws? Well, I will
discuss that briefly, but first let me define some important terms that will help you understand flaws and how to
counter them.
Let’s start with the term program. A program is any executable code. Even operating systems or database systems
are programs. I consider a program to be secure if it exactly (and only) does what it is supposed to do—nothing else!
An assessment of security may also be decided based on program’s conformity to specifications—the code is secure
if it meets security requirements. Why is this important? Because when a program is executing, it has capability to
modify your environment, and you have to make sure it only modifies what you want it to.
So, you need to consider the factors that will prevent a program from meeting the security requirements. These
factors can potentially be termed flaws in your program. A flaw can either be fault or a failure.
A fault is an anomaly introduced in a system due to human error. A fault can be introduced at the design stage
due to the designer misinterpreting an analyst’s requirements, or at the implementation stage by a programmer not
understanding the designer’s intent and coding incorrectly. A single error can generate many faults. To summarize, a
fault is a logical issue or contradiction noticed by the designers or developers of the system after it is developed.
A failure is a deviation from required functionality for a system. A failure can be discovered during any phase of
the software development life cycle (SDLC), such as testing or operation. A single fault may result in multiple failures
(e.g., a design fault that causes a program to exit if no input is entered). If the functional requirements document
contains faults, a failure would indicate that the system is not performing as required (even though it may be
performing as specified). Thus, a failure is an apparent effect of a fault: an issue visible to the user(s).
9
CHAPTER 1 ■ UNDERSTANDING SECURITY CONCEPTS
Fortunately, not every fault results in a failure. For example, if the faulty part of the code is never executed or the
faulty part of logic is never entered, then the fault will never cause the code to fail—although you can never be sure
when a failure will expose that fault!
Broadly, the flaws can be categorized as:
Non-malicious (buffer overruns, validation errors etc.) and
Malicious (virus/worm attacks, malware etc.).
In the next sections, take a closer look at these flaws, the kinds of security breaches they may produce, and how to
devise a strategy to better secure your software to protect against such breaches.
Non-Malicious Flaws
Non-malicious flaws result from unintentional, inadvertent human errors. Most of these flaws only result in program
malfunctions. A few categories, however, have caused many security breaches in the recent past.
Buffer Overflows
A buffer (or array or string) is an allotted amount of memory (or RAM) where data is held temporarily for processing.
If the program data written to a buffer exceeds a buffer’s previously defined maximum size, that program data
essentially overflows the buffer area. Some compilers detect the buffer overrun and stop the program, while others
simply presume the overrun to be additional instructions and continue execution. If execution continues, the
program data may overwrite system data (because all program and data elements share the memory space with the
operating system and other code during execution). A hacker may spot the overrun and insert code in the system
space to gain control of the operating system with higher privileges.1
Several programming techniques are used to protect from buffer overruns, such as
Forced checks for buffer overrun;
Separation of system stack areas and user code areas;
Making memory pages either writable or executable, but not both; and
Monitors to alert if system stack is overwritten.
Incomplete Mediation
Incomplete mediation occurs when a program accepts user data without validation or verification. Programs are
expected to check if the user data is within a specified range or that it follows a predefined format. When that is not
done, then a hacker can manipulate the data for unlawful purposes. For example, if a web store doesn’t mediate user
data, a hacker may turn off any client JavaScript (used for validation) or just write a script to interact with the web
server (instead of using a web browser) and send arbitrary (unmediated) values to the server to manipulate a sale. In
some cases vulnerabilities of this nature are due to failure to check default configuration on components; a web server
that by default enables shell escape for XML data is a good example.
Another example of incomplete mediation is SQL Injection, where an attacker is able to insert (and submit)
a database SQL command (instead of or along with a parameter value) that is executed by a web application,
manipulating the back-end database. A SQL injection attack can occur when a web application accepts user-supplied
1Please refer to the IEEE paper “Beyond Stack Smashing: Recent Advances in Exploiting Buffer Overruns” by Jonathan Pincus
and Brandon Baker for more details on these kind of attacks. A PDF of the article is available at http://classes.soe.ucsc.edu/
cmps223/Spring09/Pincus%2004.pdf.
10
CHAPTER 1 ■ UNDERSTANDING SECURITY CONCEPTS
input data without thorough validation. The cleverly formatted user data tricks the application into executing
unintended commands or modifying permissions to sensitive data. A hacker can get access to sensitive information
such as Social Security numbers, credit card numbers, or other financial data.
An example of SQL injection would be a web application that accepts the login name as input data and displays
all the information for a user, but doesn’t validate the input. Suppose the web application uses the following query:
"SELECT * FROM logins WHERE name ='" + LoginName + "';"
A malicious user can use a LoginName value of “' or '1'='1” which will result in the web application returning
login information for all the users (with passwords) to the malicious user.
If user input is validated against a set of defined rules for length, type, and syntax, SQL injection can be prevented.
Also, it is important to ensure that user permissions (for database access) should be limited to least possible privileges
(within the concerned database only), and system administrator accounts, like sa, should never be used for web
applications. Stored procedures that are not used should be removed, as they are easy targets for data manipulation.
Two key steps should be taken as a defense:
Server-based mediation must be performed. All client input needs to be validated by the
program (located on the server) before it is processed.
Client input needs to be checked for range validity (e.g., month is between January and
December) as well as allowed size (number of characters for text data or value for numbers for
numeric data, etc.).
Time-of-Check to Time-of-Use Errors
Time-of-Check to Time-of-Use errors occur when a system’s state (or user-controlled data) changes between the check
for authorization for a particular task and execution of that task. That is, there is lack of synchronization or serialization
between the authorization and execution of tasks. For example, a user may request modification rights to an innocuous
log file and, between the check for authorization (for this operation) and the actual granting of modification rights, may
switch the log file for a critical system file (for example, /etc/password for Linux operating system).
There are several ways to counter these errors:
Make a copy of the requested user data (for a request) to the system area, making
modifications impossible.
Lock the request data until the requested action is complete.
Perform checksum (using validation routine) on the requested data to detect modification.
Malicious Flaws
Malicious flaws produce unanticipated or undesired effects in programs and are the result of code deliberately
designed to cause damage (corruption of data, system crash, etc.). Malicious flaws are caused by viruses, worms,
rabbits, Trojan horses, trap doors, and malware:
A virus is a self-replicating program that can modify uninfected programs by attaching a
copy of its malicious code to them. The infected programs turn into viruses themselves and
replicate further to infect the whole system. A transient virus depends on its host program
(the executable program of which it is part) and runs when its host executes, spreading itself
and performing the malicious activities for which it was designed. A resident virus resides in
a system’s memory and can execute as a stand-alone program, even after its host program
completes execution.
A worm, unlike the virus that uses other programs as mediums to spread itself, is a stand-
alone program that replicates through a network.
11
CHAPTER 1 ■ UNDERSTANDING SECURITY CONCEPTS
A rabbit is a virus or worm that self-replicates without limit and exhausts a computing
resource. For example, a rabbit might replicate itself to a disk unlimited times and fill up the
disk.
A Trojan horse is code with a hidden malicious purpose in addition to its primary purpose.
A logic trigger is malicious code that executes when a particular condition occurs (e.g., when
a file is accessed). A time trigger is a logic trigger with a specific time or date as its activating
condition.
A trap door is a secret entry point into a program that can allow someone to bypass normal
authentication and gain access. Trap doors have always been used by programmers for
legitimate purposes such as troubleshooting, debugging, or testing programs; but they
become threats when unscrupulous programmers use them to gain unauthorized access
or perform malicious activities. Malware can install malicious programs or trap doors on
Internet-connected computers. Once installed, trap doors can open an Internet port and
enable anonymous, malicious data collection, promote products (adware), or perform any
other destructive tasks as designed by their creator.
How do we prevent infections from malicious code?
Install only commercial software acquired from reliable, well-known vendors.
Track the versions and vulnerabilities of all installed open source components, and maintain
an open source component-security patching strategy.
Carefully check all default configurations for any installed software; do not assume the
defaults are set for secure operation.
Test any new software in isolation.
Open only “safe” attachments from known sources. Also, avoid opening attachments from
known sources that contain a strange or peculiar message.
Maintain a recoverable system image on a daily or weekly basis (as required).
Make and retain backup copies of executable system files as well as important personal data
that might contain “infectable” code.
Use antivirus programs and schedule daily or weekly scans as appropriate. Don’t forget to
update the virus definition files, as a lot of new viruses get created each day!
Securing a Distributed System
So far, we have examined potential threats to a program’s security, but remember—a distributed system is also a
program. Not only are all the threats and resolutions discussed in the previous section applicable to distributed
systems, but the special nature of these programs makes them vulnerable in other ways as well. That leads to a need to
have multilevel security for distributed systems.
When I think about a secure distributed system, ERP (enterprise resource) systems such as SAP or PeopleSoft
come to mind. Also, relational database systems such as Oracle, Microsoft SQL Server, or Sybase are good examples
of secure systems. All these systems are equipped with multiple layers of security and have been functional for a
long time. Subsequently, they have seen a number of malicious attacks on stored data and have devised effective
countermeasures. To better understand what makes these systems safe, I will discuss how Microsoft SQL Server
secures sensitive employee salary data.
12
CHAPTER 1 ■ UNDERSTANDING SECURITY CONCEPTS
For a secure distributed system, data is hidden behind multiple layers of defenses (Figure 1-3). There are levels
such as authentication (using login name/password), authorization (roles with set of permissions), encryption
(scrambling data using keys), and so on. For SQL Server, the first layer is a user authentication layer. Second is an
authorization check to ensure that the user has necessary authorization for accessing a database through database
role(s). Specifically, any connection to a SQL Server is authenticated by the server against the stored credentials.
DB2
DB1
If the authentication is successful, the server passes the connection through. When connected, the client inherits
authorization assigned to connected login by the system administrator. That authorization includes access to any of
the system or user databases with assigned roles (for each database). That is, a user can only access the databases
he is authorized to access—and is only assigned tables with assigned permissions. At the database level, security is
further compartmentalized into table- and column-level security. When necessary, views are designed to further
segregate data and provide a more detailed level of security. Database roles are used to group security settings for a
group of tables.
Client
tries to
access
Data
Access to Customer data
(except salary details)
using roles
Salary
……………
Customer
DB3
Name Location
John Doe Chicago
Jane Doe Elgin
Mike Dey Itasca
Al Gore Boston
Jay Leno Frisco
……………………
10,000
5,000
3,000
20,000
15,000
SQL Server
Authentication –
login/password
SQL Server Authorizes access
to database DB1 only
![]()

![]()
![]()

Figure 1-3. SQL Server secures data with multiple levels of security
In Figure 1-3, the user who was authenticated and allowed to connect has been authorized to view employee data
in database DB1, except for the salary data (since he doesn’t belong to role HR and only users from Human Resources
have the HR role allocated to them). Access to sensitive data can thus be easily limited using roles in SQL Server.
Although the figure doesn’t illustrate them, more layers of security are possible, as you’ll learn in the next few sections.
Authentication
The first layer of security is authentication. SQL Server uses a login/password pair for authentication against stored
credential metadata. You can also use integrated security with Windows, and you can use a Windows login to
connect to SQL Server (assuming the system administrator has provided access to that login). Last, a certificate or
pair of asymmetric keys can be used for authentication. Useful features such as password policy enforcement (strong
password), date validity for a login, ability to block a login, and so forth are provided for added convenience.
13
CHAPTER 1 ■ UNDERSTANDING SECURITY CONCEPTS
Authorization
The second layer is authorization. It is implemented by creating users corresponding to logins in the first layer
within various databases (on a server) as required. If a user doesn’t exist within a database, he or she doesn’t have
access to it.
Within a database, there are various objects such as tables (which hold the data), views (definitions for filtered
database access that may spread over a number of tables), stored procedures (scripts using the database scripting
language), and triggers (scripts that execute when an event occurs, such as an update of a column for a table or
inserting of a row of data for a table), and a user may have either read, modify, or execute permissions for these
objects. Also, in case of tables or views, it is possible to give partial data access (to some columns only) to users. This
provides flexibility and a very high level of granularity while configuring access.
Encryption
The third security layer is encryption. SQL Server provides two ways to encrypt your data: symmetric keys/certificates
and Transparent Database Encryption (TDE). Both these methods encrypt data “at rest” while it’s stored within a
database. SQL Server also has the capability to encrypt data in transit from client to server, by configuring corresponding
public and private certificates on the server and client to use an encrypted connection. Take a closer look:
Encryption using symmetric keys/certificate: A symmetric key is a sequence of binary or
hexadecimal characters that’s used along with an encryption algorithm to encrypt the data.
The server and client must use the same key for encryption as well as decryption. To enhance
the security further, a certificate containing a public and private key pair can be required. The
client application must have this pair available for decryption. The real advantage of using
certificates and symmetric keys for encryption is the granularity it provides. For example,
you can encrypt only a single column from a single table (Figure 1-4)—no need to encrypt
the whole table or database (as with TDE). Encryption and decryption are CPU-intensive
operations and take up valuable processing resources. That also makes retrieval of encrypted
data slower as compared to unencrypted data. Last, encrypted data needs more storage. Thus
it makes sense to use this option if only a small part of your database contains sensitive data.
Create
Certificate (in
user database)
Create
Database
Master key in
user database
Database that needs
to be encrypted
Create
Symmetric key
(using the
certificate for
encryption)
All in the same user database
Decryption is performed by opening the symmetric key (that
uses certificate for decryption) and since only authorized
users have access to the certificate, access to encrypted data
is restricted
Encrypt column(s)
for any tables (using
the symmetric key)
Figure 1-4. Creating column-level encryption using symmetric keys and certificates
14
CHAPTER 1 ■ UNDERSTANDING SECURITY CONCEPTS
TDE: TDE is the mechanism SQL Server provides to encrypt a database completely using
symmetric keys and certificates. Once database encryption is enabled, all the data within
a database is encrypted while it is stored on the disk. This encryption is transparent to
any clients requesting the data, because data is automatically decrypted when it is
transferred from disk to the buffers. Figure 1-5 details the steps for implementing TDE
for a database.
Create
Certificate
(in Master
database)
Create
Database
Master key
(in Master
database)
Create
Database
Encryption key
(using the
certificate for
encryption)
This needs to be created in the user
database where TDE needs to be
enabled
Enable Encryption
for the database
Figure 1-5. Process for implementing TDE for a SQL Server database
Using encrypted connections: This option involves encrypting client connections to a SQL
Server and ensures that the data in transit is encrypted. On the server side, you must configure
the server to accept encrypted connections, create a certificate, and export it to the client that
needs to use encryption. The client’s user must then install the exported certificate on the
client, configure the client to request an encrypted connection, and open up an encrypted
connection to the server.
Figure 1-6 maps the various levels of SQL Server security. As you can see, data can be filtered (as required) at
every stage of access, providing granularity for user authorization.
15
CHAPTER 1 ■ UNDERSTANDING SECURITY CONCEPTS
AD Login
SQL
Server
Login
can be mapped
to:
Certificate/
Asymmetric
key
AD Login
Database
User
Certificate/
Asymmetric
key
Database
Role
SQL Server
Login
User can be part of a Predefined Database or Application Role; that provides a
subset of permissions. For example, ‘db_datareader’ role provides‘Read’
permission for all user-defined tables in a database
Database Encryption (optional)
Actual Data

Client Data access request
First line of Defense –
Server level
needs
authentication
a valid Login /
Password or a valid
Windows Login
(which is
authenticated via
Active Directory)
SQL Server Login can
be mapped to a
Windows AD Login or
Certificate or
Asymmetric key
Second line of
Defense –
Database level
authorization
needs a valid User
/ Role with
Authorization to
the requested
Database object
(such as a Table,
View, Stored
procedure etc.)
Again, the user
may be mapped
to a Windows
AD (or SQL
Server) Login,
Certificate or
Asymmetric key
Third line of
Defense –
Database level –
Encryption
you can encrypt
data at column,
table or
database level –
depending on its
sensitivity
Figure 1-6. SQL Server security layers with details
Hadoop is also is a distributed system and can benefit from many of the principles you learned here. In the next
two chapters, I will introduce Hadoop and give an overview of Hadoop’s security architecture (or the lack of it).
16
Summary
CHAPTER 1 ■ UNDERSTANDING SECURITY CONCEPTS
This chapter introduced general security concepts to help you better understand and appreciate the various
techniques you will use to secure Hadoop. Remember, however, that the psychological aspects of security are as
important to understand as the technology. No security protocol can help you if you readily provide your password
to a hacker!
Securing a program requires knowledge of potential flaws so that you can counter them. Non-malicious flaws
can be reduced or eliminated using quality control at each phase of the SDLC and extensive testing during the
implementation phase. Specialized antivirus software and procedural discipline is the only solution for
malicious flaws.
A distributed system needs multilevel security due to its architecture, which spreads data on multiple hosts and
modifies it through numerous processes that execute at a number of locations. So it’s important to design security
that will work at multiple levels and to secure various hosts within a system depending on their role (e.g., security
required for the central or master host will be different compared to other hosts). Most of the times, these levels are
authentication, authorization and encryption.
Last, the computing world is changing rapidly and new threats evolve on a daily basis. It is important to design
a secure system, but it is equally important to keep it up to date. A security system that was best until yesterday is not
good enough. It has to be the best today—and possibly tomorrow!
17
CHAPTER 2
Introducing Hadoop
I was at a data warehousing conference and talking with a top executive from a leading bank about Hadoop. As I was
telling him about the technology, he interjected, “But does it have any use for us? We don’t have any Internet usage
to analyze!” Well, he was just voicing a common misconception. Hadoop is not a technology meant for analyzing web
usage or log files only; it has a genuine use in the world of petabytes (of 1,000 terabytes apiece). It is a super-clever
technology that can help you manage very large volumes of data efficiently and quickly—without spending a fortune
on hardware.
Hadoop may have started in laboratories with some really smart people using it to analyze data for behavioral
purposes, but it is increasingly finding support today in the corporate world. There are some changes it needs to
undergo to survive in this new environment (such as added security), but with those additions, more and more
companies are realizing the benefits it offers for managing and processing very large data volumes.
For example, the Ford Motor Company uses Big Data technology to process the large amount of data generated
by their hybrid cars (about 25GB per hour), analyzing, summarizing, and presenting it to the driver via a mobile
app that provides information about the car’s performance, the nearest charging station, and so on. Using Big Data
solutions, Ford also analyzes the data available on social media through consumer feedback and comments about
their cars. It wouldn’t be possible to use conventional data management and analysis tools to analyze such large
volumes of diverse data.
The social networking site LinkedIn uses Hadoop along with custom-developed distributed databases, called
Voldemort and Espresso, to power its voluminous amount of data, enabling it to provide popular features such as
“People you might know” lists or the LinkedIn social graph at great speed in response to a single click. This wouldn’t
have been possible with conventional databases or storage.
Hadoop’s use of low-cost commodity hardware and built-in redundancy are major factors that make it attractive
to most companies using it for storage or archiving. In addition, features such as distributed processing that multiplies
your processing power by the number of nodes, capability of handling petabytes of data at ease; expanding capacity
without downtime; and a high amount of fault tolerance make Hadoop an attractive proposition for an increasing
number of corporate users.
In the next few sections, you will learn about Hadoop architecture, the Hadoop stack, and also about the security
issues that Hadoop architecture inherently creates. Please note that I will only discuss these security issues briefly in
this chapter; Chapter 4 contains a more detailed discussion about these issues, as well as possible solutions.
Hadoop Architecture
The hadoop.apache.org web site defines Hadoop as “a framework that allows for the distributed processing of large
data sets across clusters of computers using simple programming models.” Quite simply, that’s the philosophy: to
provide a framework that’s simple to use, can be scaled easily, and provides fault tolerance and high availability for
production usage.
19
CHAPTER 2 ■ INTRODUCING HADOOP
The idea is to use existing low-cost hardware to build a powerful system that can process petabytes of data very
efficiently and quickly. Hadoop achieves this by storing the data locally on its DataNodes and processing it locally as
well. All this is managed efficiently by the NameNode, which is the brain of the Hadoop system. All client applications
read/write data through NameNode as you can see in Figure 2-1’s simplistic Hadoop cluster.
DataNode3
DataNode2
DataNode1
NameNode
Brain of the
system
Workers or
limbs of the
system
Figure 2-1. Simple Hadoop cluster with NameNode (the brain) and DataNodes for data storage
Hadoop has two main components: the Hadoop Distributed File System (HDFS) and a framework for processing
large amounts of data in parallel using the MapReduce paradigm. Let me introduce you to HDFS first.
HDFS
HDFS is a distributed file system layer that sits on top of the native file system for an operating system. For example,
HDFS can be installed on top of ext3, ext4, or XFS file systems for the Ubuntu operating system. It provides redundant
storage for massive amounts of data using cheap, unreliable hardware. At load time, data is distributed across all the
nodes. That helps in efficient MapReduce processing. HDFS performs better with a few large files (multi-gigabytes) as
compared to a large number of small files, due to the way it is designed.
Files are “write once, read multiple times.” Append support is now available for files with the new version, but
HDFS is meant for large, streaming reads—not random access. High sustained throughput is favored over low latency.
Files in HDFS are stored as blocks and replicated for redundancy or reliability. By default, blocks are replicated
thrice across DataNodes; so three copies of every file are maintained. Also, the block size is much larger than other
file systems. For example, NTFS (for Windows) has a maximum block size of 4KB and Linux ext3 has a default of 4KB.
Compare that with the default block size of 64MB that HDFS uses!
NameNode
NameNode (or the “brain”) stores metadata and coordinates access to HDFS. Metadata is stored in NameNode’s
RAM for speedy retrieval and reduces the response time (for NameNode) while providing addresses of data blocks.
This configuration provides simple, centralized management—and also a single point of failure (SPOF) for HDFS. In
previous versions, a Secondary NameNode provided recovery from NameNode failure; but current version provides
capability to cluster a Hot Standby (where the standby node takes over all the functions of NameNode without
any user intervention) node in Active/Passive configuration to eliminate the SPOF with NameNode and provides
NameNode redundancy.
Since the metadata is stored in NameNode’s RAM and each entry for a file (with its block locations) takes some
space, a large number of small files will result in a lot of entries and take up more RAM than a small number of entries
for large files. Also, files smaller than the block size (smallest block size is 64 MB) will still be mapped to a single block,
reserving space they don’t need; that’s the reason it’s preferable to use HDFS for large files instead of small files.
Figure 2-2 illustrates the relationship between the components of an HDFS cluster.
20
NameNode (holds Metadata only)
Filename Blocks DataNode
CHAPTER 2 ■ INTRODUCING HADOOP
NameNode
holds
metadata
observe two | /usr/JonDoe/File1.txt | 1 | 1, 2 |
replicated | 4 | 1, 3 | |
copies of each | /usr/JaneDoe/File2.txt | 2 | 1, 3 |
data block | 3 | 2, 3 | |
spread over | 5 | 2, 3 | |
multiple DataNodes |
Please
DataNodes
hold the
actual data
blocks and
map of
where
each block
is located
DataNodes hold actual data in blocks
1
2
4
5
3
1
3
2
4
5
Figure 2-2. HDFS cluster with its components
HDFS File Storage and Block Replication
The HDFS file storage and replication system is significant for its built-in intelligence of block placement, which offers
a better recovery from node failures. When NameNode processes a file storage request (from a client), it stores the first
copy of a block locally on the client—if it’s part of the cluster. If not, then NameNode stores it on a DataNode that’s not
too full or busy. It stores the second copy of the block on a different DataNode residing on the same rack (yes, HDFS
considers rack usage for DataNodes while deciding block placement) and third on a DataNode residing on a different
rack, just to reduce risk of complete data loss due to a rack failure. Figure 2-2 illustrates how two replicas (of each
block) for the two files are spread over available DataNodes.
DataNodes send heartbeats to NameNode, and if a DataNode doesn’t send heartbeats for a particular duration,
it is assumed to be “lost.” NameNode finds other DataNodes (with a copy of the blocks located on that DataNode) and
instructs them to make a fresh copy of the lost blocks to another DataNode. This way, the total number of replicas for
all the blocks would always match the configured replication factor (which decides how many copies of a file will be
maintained).
21
CHAPTER 2 ■ INTRODUCING HADOOP
Adding or Removing DataNodes
It is surprisingly easy to add or remove DataNodes from a HDFS cluster. You just need to add the hostname for the
new DataNode to a configuration file (a text file named slaves) and run an administrative utility to tell NameNode
about this addition. After that, the DataNode process is started on the new DataNode and your HDFS cluster has an
additional DataNode.
DataNode removal is equally easy and just involves a reverse process—remove the hostname entry from slaves
and run the administrative utility to make NameNode aware of this deletion. After this, the DataNode process can
be shut down on that node and removed from the HDFS cluster. NameNode quietly replicates the blocks (from
decommissioned DataNode) to other DataNodes, and life moves on.
Cluster Rebalancing
Adding or removing DataNodes is easy, but it may result in your HDFS cluster being unbalanced. There are other
activities that may create unbalance within your HDFS cluster. Hadoop provides a utility (the Hadoop Balancer)
that will balance your cluster again. The Balancer moves blocks from overutilized DataNodes to underutilized ones,
while still following Hadoop’ s storage and replication policy of not having all the replicas on DataNodes located on a
single rack.
Block movement continues until utilization (the ratio of used space to total capacity) for all the DataNodes is within
a threshold percentage of each other. For example, a 5% threshold means utilization for all DataNodes is within 5%.
The balancer runs in the background with a low bandwidth without taxing the cluster.
Disk Storage
HDFS uses local storage for NameNode, Secondary NameNode, and DataNodes, so it’s important to use the correct
storage type. NameNode, being the brain of the cluster, needs to have redundant and fault-tolerant storage. Using
RAID 10 (striping and mirroring your data across at least two disks) is highly recommended. Secondary NameNode
needs to have RAID 10 storage. As far as the DataNodes are concerned, they can use local JBOD (just a bunch of disks)
storage. Remember, data on these nodes is already replicated thrice (or whatever the replication factor is), so there is
no real need for using RAID drives.
22
Secondary NameNode
CHAPTER 2 ■ INTRODUCING HADOOP
Let’s now consider how Secondary NameNode maintains a standby copy of NameNode metadata. The NameNode
uses an image file called fsimage to store the current state of HDFS (a map of all files stored within the file system and
locations of their corresponding blocks) and a file called edits to store modifications to HDFS. With time, the edits
file can grow very large; as a result, the fsimage wouldn’t have an up-to-date image that correctly reflects the state of
HDFS. In such a situation, if the NameNode crashes, the current state of HDFS will be lost and the data unusable.
To avoid this, the Secondary NameNode performs a checkpoint (every hour by default), merges the fsimage and
edits files from NameNode locally, and copies the result back to the NameNode. So, in a worst-case scenario, only the
edits or modifications made to HDFS will be lost—since the Secondary NameNode stores the latest copy of fsimage
locally. Figure 2-3 provides more insight into this process.
Brain of
NameNode
Create new Edits file
![]()
![]()
![]()
![]()
Copy fsimage
and Edits
4 Copy the new fsimage
Secondary
NameNode
Apply edits to
the
system;
very
important
to have a
backup of
its
metadata
Local
storage
back
Secondary NameNode
creates a fresh copy of
fsimage (NameNode
metadata) and copies it
back to NameNode
3
Local
storage
fsimage and
generate a
new fsimage
locally
Figure 2-3. Checkpoint performed by Secondary NameNode
What does all this mean for your data? Consider how HDFS processes a request. Figure 2-4 shows how a data
request is addressed by NameNode and data is retrieved from corresponding DataNodes.
23
CHAPTER 2 ■ INTRODUCING HADOOP
Data file
GiveMeData.txt
394-22-4567,Baker,Dave,203 Main street,Itasca,IL,8471234567
296-33-5563,Skwerski,Steve,1303 Pine sreet,Lombard,IL,6304561230
322-42-8765,Norberg,Scott,203 Main street,Lisle,IL,6304712345
File is
stored
as
Data
Blocks
1 Client requests file
Hadoop GiveMeData.txt
client
3
2
NameNode
Filename
GiveMeData.txt
Client
contacts
DataNode(s)
& gets the 1st
& subsequent
data block(s)
NameNode provides
1st & subsequent data block
addresses
Blocks DataNode
1,2 1
3,5 2
4 3
6 4
NameNode
holding Metadata
DataNode1
DataNode2
1 2 3 5
DataNode3
4
DataNode4
6


![]()
![]()
![]()
Client
requests
Metadata to
NameNode
and then
retrieves it
via
communication
with
DataNodes
DataNodes
holding
actual Data
Blocks
Data retrieval continues with NameNode providing addresses of subsequent blocks from
appropriate DataNodes; till the whole file is retrieved
Figure 2-4. Anatomy of a Hadoop data access request
NameNode High Availability
As you remember from the Name Node section, NameNode is a SPOF. But if a Hadoop cluster is used as a production
system, there needs to be a way to eliminate this dependency and make sure that the cluster will work normally even
in case of NameNode failure. One of the ways to counter NameNode failure is using NameNode high availability (or
HA), where a cluster is deployed with an active/passive pair of NameNodes. The edits write-ahead log needs to be
available for both NameNodes (active/passive) and hence is located on a shared NFS directory. The active NameNode
writes to the edits log and the standby NameNode replays the same transactions to ensure it is up to date (to be ready
to take over in case of a failure). DataNodes send block reports to both the nodes.
You can configure an HA NameNode pair for manual or automatic failover (active and passive nodes
interchanging roles). For manual failover, a command needs to be executed to have the Standby NameNode take over
as Primary or active NameNode. For automatic failover, each NameNode needs to run an additional process called
a failover controller for monitoring the NameNode processes and coordinate the state transition as required. The
application ZooKeeper is often used to manage failovers.
In case of a failover, it’s not possible to determine if an active NameNode is not available or if it’s inaccessible
from the standby NameNode. If both NameNode processes run parallel, they can both write to the shared state and
corrupt the file system metadata. This constitutes a split-brain scenario, and to avoid this situation, you need to ensure
that the failed NameNode is stopped or “fenced.” Increasingly severe techniques are used to implement fencing;
starting with a stop request via RPC (remote procedure call) to a STONITH (or “shoot the other node in the head”)
implemented by issuing a reboot remotely or (programmatically) cutting power to a machine for a short duration.
When using HA, since the standby NameNode takes over the role of the Secondary NameNode, no separate
Secondary NameNode process is necessary.
24
Inherent Security Issues with HDFS Architecture
CHAPTER 2 ■ INTRODUCING HADOOP
After reviewing HDFS architecture, you can see that this is not the traditional client/server model of processing data
we are all used to. There is no server to process the data, authenticate the users, or manage locking. There was no
security gateway or authentication mechanism in the original Hadoop design. Although Hadoop now has strong
authentication built in (as you shall see later), complexity of integration with existing corporate systems and
role-based authorization still presents challenges.
Any user with access to the server running NameNode processes and having execute permissions to the Hadoop
binaries can potentially request data from NameNode and request deletion of that data, too! Access is limited only by
Hadoop directory and file permissions; but it’s easy to impersonate another user (in this case a Hadoop superuser)
and access everything. Moreover, Hadoop doesn’t enable you to provide role-based access or object-level access, or
offer enough granularity for attribute-level access (for a particular object). For example, it doesn’t offer special roles
with ability to run specific Hadoop daemons (or services). There is an all-powerful Hadoop superuser in the admin
role, but everyone else is a mere mortal. Users simply have access to connect to HDFS and access all files, unless file
access permissions are specified for specific owners or groups.
Therefore, the flexibility that Hadoop architecture provides also creates vulnerabilities due to lack of a central
authentication mechanism. Because data is spread across a large number of DataNodes, along with the advantages
of distributed storage and processing, the DataNodes also serve as potential entry points for attacks and need to be
secured well.
Hadoop clients perform metadata operations such as create file and open file at the NameNode using
RPC protocol and read/write the data of a file directly from DataNodes using a streaming socket protocol called
the data-transfer protocol. It is possible to encrypt communication done via RPC protocol easily through Hadoop
configuration files, but encrypting the data traffic between DataNodes and client requires use of Kerberos or SASL
(Simple Authentication and Security Layer) framework.
The HTTP communication between web consoles and Hadoop daemons (NameNode, Secondary NameNode,
DataNode, etc.) is unencrypted and unsecured (it allows access without any form of authentication by default), as
seen in Figure 2-5. So, it’s very easy to access all the cluster metadata. To summarize, the following threats exist for
HDFS due to its architecture:
An unauthorized client may access an HDFS file or cluster metadata via the RPC or HTTP
protocols (since the communication is unencrypted and unsecured by default).
An unauthorized client may read/write a data block of a file at a DataNode via the pipeline
streaming data-transfer protocol (again, unencrypted communication).
A task or node may masquerade as a Hadoop service component (such as DataNode) and
modify the metadata or perform destructive activities.
A malicious user with network access could intercept unencrypted internode
communications.
Data on failed disks in a large Hadoop cluster can leak private information if not handled
properly.
25
CHAPTER 2 ■ INTRODUCING HADOOP
DataNode1
NameNode
Unsecure
communication -
possible for a
rogue process
to masquerade
as DataNode
Web
Console
Unsecure
communication
DataNode2
DataNode3
NameNode
authorizes
using file
ACLs only;
easily possible
to connect as
another user
Data Transfer protocol
RPC protocol
Unsecure communication;
no ACL for authorization.
Unauthorized access to
data blocks possible
Client
Figure 2-5. Hadoop communication protocols and vulnerabilities
When Hadoop daemons (or services) communicate with each other, they don’t verify that the other service is
really what it claims to be. So, it’s easily possible to start a rogue TaskTracker to get access to data blocks. There are
ways to have Hadoop services perform mutual authentication; but Hadoop doesn’t implement them by default and
they need configuration changes as well as some additional components to be installed. Figure 2-5 summarizes
these threats.
We will revisit the security issues in greater detail (with pertinent solutions) in Chapters 4 and 5 (which cover
authentication and authorization) and Chapter 8 (which focuses on encryption). For now, turn your attention to the
other major Hadoop component: the framework for processing large amounts of data in parallel using MapReduce
paradigm.
Hadoop’s Job Framework using MapReduce
In earlier sections, we reviewed one aspect of Hadoop: HDFS, which is responsible for distributing (and storing) data
across multiple DataNodes. The other aspect is distributed processing of that data; this is handled by Hadoop’s job
framework, which uses MapReduce.
MapReduce is a method for distributing a task across multiple nodes. Each node processes data stored on that
node (where possible). It consists of two phases: Map and Reduce. The Map task works on a split or part of input
data (a key-value pair), transforms it, and outputs the transformed intermediate data. Then there is a data exchange
between nodes in a shuffle (sorting) process, and intermediate data of the same key goes to the same Reducer.
When a Reducer receives output from various mappers, it sorts the incoming data using the key (of the
key-value pair) and groups together all values for the same key. The reduce method is then invoked (by the Reducer).
It generates a (possibly empty) list of key-value pairs by iterating over the values associated with a given key and writes
output to an output file.
The MapReduce framework utilizes two Hadoop daemons (JobTracker and TaskTracker) to schedule and
process MapReduce jobs. The JobTracker runs on the master node (usually the same node that’s running NameNode)
and manages all jobs submitted for a Hadoop cluster. A JobTracker uses a number of TaskTrackers on slave nodes
(DataNodes) to process parts of a job as required.
A task attempt is an instance of a task running on a slave (TaskTracker) node. Task attempts can fail, in which case
they will be restarted. Thus there will be at least as many task attempts as there are tasks that need to be performed.
26
CHAPTER 2 ■ INTRODUCING HADOOP
Subsequently, a MapReduce program results in the following steps:
The client program submits a job (data request) to Hadoop.
The job consists of a mapper, a reducer, and a list of inputs.
The job is sent to the JobTracker process on the master node.
Each slave node runs a process called the TaskTracker.
The JobTracker instructs TaskTrackers to run and monitor tasks (a Map or Reduce task for
input data).
Map task2
Figure 2-6 illustrates Hadoop’s MapReduce framework and how it processes a job.
1
Submits a job
Client
JobTracker
2
4
2
TaskTracker1
TaskTracker2
3 Map output2
TaskTracker3
3 Map output1

Job output
Figure 2-6. MapReduce framework and job processing
Task processes send heartbeats to the TaskTracker. TaskTrackers send heartbeats to the JobTracker. Any task that
fails to report in 10 minutes is assumed to have failed and is killed by the TaskTracker. Also, any task that throws an
exception is said to have failed.
Failed tasks are reported to the JobTracker by the TaskTracker. The JobTracker reschedules any failed tasks
and tries to avoid rescheduling the task on the same TaskTracker where it previously failed. If a task fails more than
four times, the whole job fails. Any TaskTracker that fails to report in 10 minutes is assumed to have crashed and all
assigned tasks restart on another TaskTracker node.
Any TaskTracker reporting a high number of failed tasks is blacklisted (to prevent the node from blocking the
entire job). There is also a global blacklist for TaskTrackers that fail on multiple jobs. The JobTracker manages the state
of each job and partial results of failed tasks are ignored.
Figure 2-7 shows how the MapReduce paradigm works for input key-value pairs and results in a reduced output.
27
CHAPTER 2 ■ INTRODUCING HADOOP
Input
records
to a job
Output
records
from a job
Key / Value pairs

Intermediate job output
and data shuffling
Figure 2-7. MapReduce processing for a job
A detailed coverage of MapReduce is beyond this book’s coverage; interested readers can refer to Pro Hadoop by
Jason Venner (Apress, 2009). Jason introduces MapReduce in Chapter 2 and discusses the anatomy of a MapReduce
program at length in Chapter 5. Each of the components of MapReduce is discussed in great detail, offering an
in-depth understanding.
Apache Hadoop YARN
The MapReduce algorithm used by earlier versions of Hadoop wasn’t sufficient in many cases for scenarios where
customized resource handling was required. With YARN, Hadoop now has a generic distributed data processing
framework (with a built-in scheduler) that can be used to define your own resource handling. Hadoop MapReduce is
now just one of the distributed data processing applications that can be used with YARN.
YARN allocates the two major functionalities of the JobTracker (resource management and job scheduling/
monitoring) to separate daemons: a global ResourceManager and a per-application ApplicationMaster. The
ResourceManager and NodeManager (which runs on each “slave” node) form a generic distributed data processing
system in conjunction with the ApplicationMaster.
ResourceManager is the overall authority that allocates resources for all the distributed data processing
applications within a cluster. ResourceManager uses a pluggable Scheduler (of your choice—e.g., Fair or first-in, first-
out [FIFO] scheduler) that is responsible for allocating resources to various applications based on their need. This
Scheduler doesn’t perform monitoring, track status, or restart failed tasks.
The per-application ApplicationMaster negotiates resources from the ResourceManager, works with the
NodeManager(s) to execute the component tasks, tracks their status, and monitors their progress. This functionality
was performed earlier by TaskTracker (plus the scheduling, of course).
The NodeManager is responsible for launching the applications’ containers, monitoring their resource usage
(CPU, memory, disk, network), and reporting it to the ResourceManager.
So, what are the differences between MapReduce and YARN? As cited earlier, YARN splits the JobTracker
functionalities to ResourceManager (scheduling) and Application Master (resource management). Interestingly, that
also moves all the application-framework-specific code to ApplicationMaster, generalizing the system so that multiple
distributed processing frameworks such as MapReduce, MPI (Message Passing Interface, a message-passing system
for parallel computers, used in development of many scalable large-scale parallel applications) and Graph Processing
can be supported.
28
Inherent Security Issues with Hadoop’s Job Framework
CHAPTER 2 ■ INTRODUCING HADOOP
The security issues with the MapReduce framework revolve around the lack of authentication within Hadoop, the
communication between Hadoop daemons being unsecured, and the fact that Hadoop daemons do not authenticate
each other. The main security concerns are as follows:
An unauthorized user may submit a job to a queue or delete or change priority of the job
(since Hadoop doesn’t authenticate or authorize and it’s easy to impersonate a user).
An unauthorized client may access the intermediate data of a Map job via its TaskTracker’s
HTTP shuffle protocol (which is unencrypted and unsecured).
An executing task may use the host operating system interfaces to access other tasks and local
data, which includes intermediate Map output or the local storage of the DataNode that runs
on the same physical node (data at rest is unencrypted).
A task or node may masquerade as a Hadoop service component such as a DataNode,
NameNode, JobTracker, TaskTracker, etc. (no host process authentication).
A user may submit a workflow (using a workflow package like Oozie) as another user (it’s easy
to impersonate a user).
As you remember, Figure 2-6 illustrated how the MapReduce framework processes a job. Comparing Figure 2-6
with Figure 2-8 will give you a better insight into the security issues with the MapReduce framework. Figure 2-8 details
the security issues in the same context: job execution.

![]()
Possible Unauthorized user
Client
Client may access
intermediate data
Submits a job JobTracker
Map task2
TaskTracker1 TaskTracker2
A rogue process
might
masquerade as
Hadoop
component
Map output1
Map output2
TaskTracker3
TaskTracker1 may access intermediate data
produced by TaskTracker2
Job output
Figure 2-8. MapReduce framework vulnerabilities
Hadoop’s Operational Security Woes
The security issues discussed so far stem from Hadoop’s architecture and are not operational issues that we have to
deal with on a daily basis. Some issues arise from Hadoop’s relative newness and origins in isolated laboratories with
insulated, secure environments. There was a time when Hadoop was “secured” by severely restricting network access
to Hadoop clusters. Any access request had to be accompanied by several waivers from security departments and the
requestor’s own management hierarchy!
29
CHAPTER 2 ■ INTRODUCING HADOOP
Also, some existing technologies have not had time to build interfaces or provide gateways to integrate with
Hadoop. For example, a few features that are missing right now may even have been added by the time you read this.
Like Unix of yesteryear, Hadoop is still a work in progress and new features as well as new technologies are added on a
daily basis. With that in mind, consider some operational security challenges that Hadoop currently has.
Inability to Use Existing User Credentials and Policies
Suppose your organization uses single sign-on or active directory domain accounts for connecting to the various
applications used. How can you use them with Hadoop? Well, Hadoop does offer LDAP (Lightweight Directory Access
Protocol) integration, but configuring it is not easy, as this interface is still in a nascent stage and documentation is
extremely sketchy (in some cases there is no documentation). The situation is compounded by Hadoop being used
on a variety of Linux flavors, and issues vary by operating system used and its versions. Hence, allocating selective
Hadoop resources to active directory users is not always possible.
Also, how can you enforce existing access control policies such as read access for application users, read/
write for developers, and so forth? The answer is that you can’t. The easiest way is to create separate credentials for
Hadoop access and reestablish access control manually, following the organizational policies. Hadoop follows its own
model for security, which is similar (in appearance) to Linux and confuses a lot of people. Hadoop and the Hadoop
ecosystem combine many components with different configuration endpoints and varied authorization methods
(POSIX file-based, SQL database-like), and this can present a big challenge in developing and maintaining security
authorization policy. The community has projects to address these issues (e.g., Apache Sentry and Argus), but as of
this writing no comprehensive solution exists.
Difficult to Integrate with Enterprise Security
Most of the organizations use an enterprise security solution for achieving a variety of objectives. Sometimes it is to
mitigate the risk of cyberattacks, for security compliance, or for simply establishing customer trust. Hadoop, however,
can’t integrate with any of these security solutions. It may be possible to write a custom plug-in to accommodate
Hadoop; but it may not be possible to have Hadoop comply with all the security policies.
Unencrypted Data in Transit
Hadoop is a distributed system and hence consists of several nodes (such as NameNode and a number of DataNodes)
with data communication between them. That means data is transmitted over the network, but it is not encrypted.
This may be sensitive financial data such as account information or personal data (such as a Social Security number),
and it is open to attacks.
Internode communication in Hadoop uses protocols such as RPC, TCP/IP, and HTTP. Currently, only RPC
communication can be encrypted easily (that’s communication between NameNode, JobTracker, DataNodes, and
Hadoop clients), leaving the actual read/write of file data between clients and DataNodes (TCP/IP) and HTTP
communication (web consoles, communication between NameNode/Secondary NameNode and MapReduce shuffle
data) open to attacks.
It is possible to encrypt TCP/IP or HTTP communication; but that needs use of Kerberos or SASL (Simple
Authentication and Security Layer) frameworks. Also, Hadoop's built-in encryption has a very negative impact on
performance and is not widely used.
No Data Encryption at Rest
At rest, data is stored on disk. Hadoop doesn’t encrypt data that’s stored on disk and that can expose sensitive data to
malevolent attacks. Currently, no codec or framework is provided for this purpose. This is especially a big issue due to
the nature of Hadoop architecture, which spreads data across a large number of nodes, exposing the data blocks at all
those unsecured entry points.
30
CHAPTER 2 ■ INTRODUCING HADOOP
There are a number of choices for implementing encryption at rest with Hadoop; but they are offered by different
vendors and rely on their distributions for implementing encryption. Most notable was the Intel Hadoop distribution
that provided encryption for data stored on disk and used Apache as well as custom codecs for encrypting data. Some
of that functionality is proposed to be available through Project Rhino (an Apache open source project).
You have to understand that since Hadoop usually deals with large volumes of data and encryption/decryption
takes time, it is important that the framework used performs the encryption/decryption fast enough, so that it doesn’t
impact performance. The Intel distribution claimed to perform these operations with great speed—provided Intel
CPUs were used along with Intel disk drives and all the other related hardware.
Hadoop Doesn’t Track Data Provenance
There are situations where a multistep MapReduce job fails at an intermediate step, and since the execution is often
batch oriented, it is very difficult to debug the failure because the output data set is all that’s available.
Data provenance is a process that captures how data is processed through the workflow and aids debugging by
enabling backward tracing—finding the input data that resulted in output for any given step. If the output is unusual
(or not what was expected), backward tracing can be used to determine the input that was processed.
Hadoop doesn’t provide any facilities for data provenance (or backward tracing); you need to use a third-party
tool such as RAMP if you require data provenance. That makes troubleshooting job failures really hard and time
consuming.
This concludes our discussion of Hadoop architecture and the related security issues. We will discuss the
Hadoop Stack next.
The Hadoop Stack
Hadoop core modules and main components are referred to as the Hadoop Stack. Together, the Hadoop core
modules provide the basic working functionality for a Hadoop cluster. The Hadoop Common module provides the
shared libraries, and HDFS offers the distributed storage and functionality of a fault-tolerant file system. MapReduce
or YARN provides the distributed data processing functionality. So, without all the bells and whistles, that’s a
functional Hadoop cluster. You can configure a node to be the NameNode and add a couple of DataNodes for a basic,
functioning Hadoop cluster.
Here’s a brief introduction to each of the core modules:
Hadoop Common: These are the common libraries or utilities that support functioning
of other Hadoop modules. Since the other modules use these libraries heavily, this is the
backbone of Hadoop and is absolutely necessary for its working.
Hadoop Distributed File System (HDFS): HDFS is at the heart of a Hadoop cluster. It is a
distributed file system that is fault tolerant, easily scalable, and provides high throughput
using local processing and local data storage at the data nodes. (I have already discussed
HDFS in great detail in the “HDFS” section).
Hadoop YARN: YARN is a framework for job scheduling and cluster resource management.
It uses a global resource manager process to effectively manage data processing resources for
a Hadoop cluster in conjunction with Node Manager on each data node.
The resource manager also has a pluggable scheduler (any scheduler can be used such as
the FIFO or Fair scheduler) that can schedule jobs and works with the Application Master
Process on DataNodes. It uses MapReduce as a distributed data processing algorithm by
default, but can also use any other distributed processing application as required.
31
CHAPTER 2 ■ INTRODUCING HADOOP
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
MapReduce is the algorithm that takes “processing to data.” All the data nodes can process
maps (transformations of input to desired output) and reduce (sorting and merging of output)
locally, independently and in parallel, to provide the high throughput that’s required for very
large datasets. I have discussed MapReduce in detail earlier in the “Hadoop’s Job Framework
using MapReduce” section.
So, you now know what the Hadoop core modules are, but how do they relate to each other to form a cohesive
system with the expected functionality? Figure 2-9 illustrates the interconnections.
Data Access & Retrieval
Job Processing
Hadoop
client
HDFS
(Hadoop
Distributed
File System)
Other Apps
Communication while
executing client requests
MapReduce
Hadoop YARN
Hadoop Common
(Common libraries or routines)
Operating system
Job
processing is
handled by
YARN using
MapReduce or
any other
distributed
processing
application
Data Processing is handled by HDFS and services
involved are NameNode and DataNodes
Hadoop Common
libraries are
backbone of all
the Hadoop
services and
provide support
for common
functionality
used by other
services
Figure 2-9. Hadoop core modules and their interrelations
As you can see, the two major aspects of Hadoop are distributed storage and distributed data processing. You
can also see clearly the dependency of both these aspects on Hadoop Common libraries and the operating system.
Hadoop is like any other application that runs in the context of the operating system. But then what happens to the
security? Is it inherited from the operating system? Well, that’s where the problem is. Security is not inherited from the
operating system and Hadoop’s security, while improving, is still immature and difficult to configure. You therefore
have to find ways to authenticate, authorize, and encrypt data within your Hadoop cluster. You will learn about those
techniques in Chapters 4, 5, 8, and 9.
Lastly, please note that in the real world, it is very common to have NameNode (which manages HDFS
processing) and JobTracker (which manages job processing) running on the same node. So, Figure 2-9 only indicates
a logical division of processing; it may not necessarily be true in case of physical implementation.
Main Hadoop Components
As you saw in the last section, Hadoop core modules provide basic Hadoop cluster functionality, but the main
components are not limited to core modules. After all, a basic Hadoop cluster can’t be used as a production
environment. Additional functionality such as ETL and bulk-load capability from other (non-Hadoop) data sources,
scheduling, fast key-based retrieval, and query capability (for data) are required for any data storage and management
system. Hadoop’s main components provide these missing capabilities as well.
32
CHAPTER 2 ■ INTRODUCING HADOOP
For example, the Pig component provides a data flow language useful for designing ETL. Sqoop provides a way
to transfer data between HDFS and relational databases. Hive provides query capability with an SQL-like language.
Oozie provides scheduling functionality, and HBase adds columnar storage for massive data storage and fast
key-based retrieval. Table 2-1 lists some popular components along with their usage.
Table 2-1. Popular Hadoop Components
Component Description Notes
HBase HBase is an open source, distributed,
versioned, column-oriented data store.
Hive Hive provides a SQL-like query language
(HiveQL) that can be used to query HDFS
data.
Pig Pig is a data flow language that can effectively
be used as an ETL system for warehousing
environments.
Sqoop Sqoop provides connectivity with relational
databases (Microsoft SQL Server, Oracle,
MySQL, etc.), data warehouses, as well
as NoSQL databases (Cassandra, HBase,
MongoDB, etc.).
It can be used to store large volumes of
structured and unstructured data. It provides
key-based access to data and hence can retrieve
data very quickly. It is highly scalable and uses
HDFS for data storage. Real strengths of HBase
are its ability to store unstructured schema-less
data and retrieve it really fast using the row keys.
Hive converts the queries to MapReduce jobs,
runs them, and displays the results. Hive “tables”
are actually files within HDFS. Hive is suited
for data warehouse use, as it doesn’t support
row-level inserts, updates, or deletes. Over 95%
of Facebook’s Hadoop jobs are now driven by a
Hive front end.
Like actual pigs, which eat almost anything, the
Pig programming language is designed to handle
any kind of data—hence the name. Using Pig,
you can load HDFS data you want to manipulate,
run the data through a set of transformations
(which, behind the scenes, are translated into
MapReduce tasks), and display the results on
screen or write them to a file.
It is easy to transfer data between HDFS (or Hive/
HBase tables) and any of these data sources
using Sqoop “connectors.” Sqoop integrates with
Oozie to schedule data transfer tasks. Sqoop’s
first version was a command-line client; but
Sqoop2 has a GUI front end and a server that can
be used with multiple Sqoop clients.
(continued)
33
CHAPTER 2 ■ INTRODUCING HADOOP
Table 2-1. (continued)
Component Description Notes
Oozie Oozie is a workflow scheduler, meaning it
runs jobs based on workflow. In this context,
workflow is a collection of actions arranged
in a control dependency DAG (Direct Acyclic
Graph).
Flume Flume is a distributed system for moving large
amounts of data from multiple sources (while
transforming or aggregating it as needed) to a
centralized destination or a data store.
Control dependency between actions simply
defines the sequence of actions; for example,
the second action can’t start until the first action
is completed. DAG refers to a loopless graph
that has a starting point and an end point and
proceeds in one direction without ever reversing.
To summarize, Oozie simply executes actions
or jobs (considering the dependencies) in a
predefined sequence. A following step in the
sequence is not started unless Oozie receives a
completion response from the remote system
executing the current step or job. Oozie is
commonly used to schedule Pig or Sqoop
workflows and integrates well with them.
Flume has sources, decorators, and sinks.
Sources are data sources such as log files, output
of processes, traffic at a TCP/IP port, etc., and
Flume has many predefined sources for ease
of use. Decorators are operations on the source
stream (e.g. compress or un-compress data,
adding or removing certain characters from data
stream, grouping and averaging numeric data
etc.). Sinks are targets such as text files, console
displays, or HDFS files. A popular use of Flume
is to move diagnostic or job log files to a central
location and analyze using keywords (e.g., “error”
or “failure”).
Mahout Mahout is a machine learning tool Remember how Amazon or Netflix recommends
products when you visit their sites based on
your browsing history or prior purchases? That’s
Mahout or a similar machine-learning tool in
action, coming up with the recommendations
using what’s termed collaborative filtering—one
of the machine-learning tasks Mahout uses that
generates recommendations based on a user’s
clicks, ratings, or past purchases.
Mahout uses several other techniques to “learn”
or make sense of data, and it provides excellent
means to develop machine-learning or
data-mining libraries that are highly scalable
(i.e., they can be still be used in case data
volumes change astronomically).
34
CHAPTER 2 ■ INTRODUCING HADOOP
You might have observed that no component is dedicated to providing security. You will need to use open source
products, such as Kerberos and Sentry, to supplement this functionality. You’ll learn more about these in
Chapters 4 and 5.
It is important to have this brief introduction to the main components, as I am assuming usage of an “extended”
Hadoop cluster (core modules and main components) throughout the book while discussing security implementation
as well as use of monitoring (Chapter 7), logging (Chapter 6), or encryption (Chapters 8 and 9).
Summary
This chapter introduced Hadoop’s architecture, core modules, main components, and inherent security issues.
Hadoop is not a perfectly secure system, but what is? And how does Hadoop compare with it? What
modifications will you need to make to Hadoop in order to make it a secure system? Chapter 1 briefly outlined
a model secure system (SQL Server), and I will discuss how to secure Hadoop in Chapters 4 to 8 using various
techniques.
In later chapters, you will also learn how a Hadoop cluster uses the Hadoop Stack (Hadoop core modules and
main components together) presented here. Understanding the workings of the Hadoop Stack will also make it easier
for you to understand the solutions I am proposing to supplement security. The next chapter provides an overview of
the solutions I will discuss throughout the book. Chapter 3 will also help you decide which specific solutions you want
to focus on and direct you to the chapter where you can find the details you need.
35
CHAPTER 3
Introducing Hadoop Security
We live in a very insecure world. Starting with the key to your home’s front door to those all-important virtual keys,
your passwords, everything needs to be secured—and well. In the world of Big Data where humungous amounts of
data are processed, transformed, and stored, it’s all the more important to secure your data.
A few years back, the London Police arrested a group of young people for fraud and theft of digital assets worth
$30 million. Their 20-year-old leader used Zeus Trojan, software designed to steal banking information, from his
laptop to commit the crime. Incidents like these are commonplace because of the large amount of information
and myriad systems involved even while conducting simple business transactions. In the past, there were probably
only thousands who could potentially access your data to commit a crime against you; now, with the advent of the
Internet, there are potentially billions! Likewise, before Big Data existed, only direct access to specific data on specific
systems was a danger; now, Big Data multiplies the places such information is stored and hence provides more ways
to compromise your privacy or worse. Everything in the new technology-driven, Internet-powered world has been
scaled up and scaled out—crime and the potential for crime included.
Imagine if your company spent a couple of million dollars installing a Hadoop cluster to gather and analyze your
customers’ spending habits for a product category using a Big Data solution. Because that solution was not secure,
your competitor got access to that data and your sales dropped 20% for that product category. How did the system
allow unauthorized access to data? Wasn’t there any authentication mechanism in place? Why were there no alerts?
This scenario should make you think about the importance of security, especially where sensitive data is involved.
Although Hadoop does have inherent security concerns due to its distributed architecture (as you saw in
Chapter 2), the situation described is extremely unlikely to occur on a Hadoop installation that’s managed securely.
A Hadoop installation that has clearly defined user roles and multiple levels of authentication (and encryption) for
sensitive data will not let any unauthorized access go through.
This chapter serves as a roadmap for the rest of the book. It provides a brief overview of each of the techniques
you need to implement to secure your Hadoop installation; later chapters will then cover the topics in more detail.
The purpose is to provide a quick overview of the security options and also help you locate relevant techniques
quickly as needed. I start with authentication (using Kerberos), move on to authorization (using Hadoop ACLs and
Apache Sentry), and then discuss secure administration (audit logging and monitoring). Last, the chapter examines
encryption for Hadoop and available options. I have used open source software wherever possible, so you can easily
build your own Hadoop cluster to try out some of the techniques described in this book.
As the foundation for all that, however, you need to understand the way Hadoop was developed and also a
little about the Hadoop architecture. Armed with this background information, you will better understand the
authentication and authorization techniques discussed later in the chapter.
Starting with Hadoop Security
When talking about Hadoop security, you have to consider how Hadoop was conceptualized. When Doug Cutting
and Mike Cafarella started developing Hadoop, security was not exactly the priority. I am certain it was not even
considered as part of the initial design. Hadoop was meant to process large amounts of web data in the public
37
CHAPTER 3 ■ INTRODUCING HADOOP SECURITY
domain, and hence security was not the focus of development. That’s why it lacked a security model and only
provided basic authentication for HDFS—which was not very useful, since it was extremely easy to impersonate
another user.
Another issue is that Hadoop was not designed and developed as a cohesive system with predefined modules,
but was rather developed as a collage of modules that either correspond to various open source projects or a set
of (proprietary) extensions developed by various vendors to supplement functionality lacking within the Hadoop
ecosystem.
Therefore, Hadoop assumes the isolation of (or a cocoon of) a trusted environment for its cluster to operate
without any security violations—and that’s lacking most of the time. Right now, Hadoop is transitioning from an
experimental or emerging technology stage to enterprise-level and corporate use. These new users need a way to
secure sensitive business data.
Currently, the standard community-supported way of securing a Hadoop cluster is to use Kerberos security.
Hadoop and its major components now fully support Kerberos authentication. That merely adds a level of
authentication, though. With just Kerberos added there is still no consistent built-in way to define user roles for finer
control across components, no way to secure access to Hadoop processes (or daemons) or encrypt data in transit
(or even at rest). A secure system needs to address all these issues and also offer more features and ways to customize
security for specific needs. Throughout the book, you will learn how to use these techniques with Hadoop. For now,
let’s start with a brief look at a popular solution to address Hadoop’s authentication issue.
Introducing Authentication and Authorization for HDFS
The first and most important consideration for security is authentication. A user needs to be authenticated before he
is allowed to access the Hadoop cluster. Since Hadoop doesn’t do any secure authentication, Kerberos is often used
with Hadoop to provide authentication.
When Kerberos is implemented for security, a client (who is trying to access Hadoop cluster) contacts the KDC
(the central Kerberos server that hosts the credential database) and requests access. If the provided credentials are
valid, KDC provides requested access. We can divide the Kerberos authentication process into three main steps:
TGT generation, where Authentication Server (AS) grants the client a Ticket Granting
Ticket (TGT) as an authentication token. A client can use the same TGT for multiple TGS
requests (until the TGT expires).
TGS session ticket generation, where the client uses credentials to decrypt TGT and
then uses TGT to get a service ticket from the Ticket Granting Server (TGS) that is granting
server access to a Hadoop cluster.
Service access, where the client uses the service ticket to authenticate and access a
Hadoop cluster.
Chapter 4 discusses the details of Kerberos architecture and also how Kerberos can be configured to be used
with Hadoop. In addition, you’ll find a step-by-step tutorial that will help you in setting up Kerberos to provide
authentication for your Hadoop cluster.
Authorization
When implementing security, your next step is authorization. Specifically, how can you implement fine-grained
authorization and roles in Hadoop? The biggest issue is that all information is stored in files, just like on a Linux host
(HDFS is, after all, a file system). There is no concept of a table (like relational databases) and that makes it harder to
authorize a user for partial access to the stored data.
38
CHAPTER 3 ■ INTRODUCING HADOOP SECURITY
Whether you call it defining details of authorization, designing fine-grained authorization, or “fine tuning”
security, it’s a multistep process. The steps are:
Analyze your environment,
Classify data for access,
Determine who needs access to what data,
Determine the level of necessary access, and
Implement your designed security model.
However, you have to remember that Hadoop (and its distributed file system) stores all its data in files, and hence
there are limitations to the granularity of security you can design. Like Unix or Linux, Hadoop has a permissions
model very similar to the POSIX-based (portable operating system interface) model—and it’s easy to confuse those
permissions for Linux permissions—so the permission granularity is limited to read or write permissions to files or
directories. You might say, “What’s the problem? My Oracle or PostgreSQL database stores data on disk in files, why is
Hadoop different?” Well, with the traditional database security model, all access is managed through clearly defined
roles and channeled through a central server process. In contrast, with data files stored within HDFS, there is no such
central process and multiple services like Hive or HBase can directly access HDFS files.
To give you a detailed understanding of the authorization possible using file/directory-based permissions,
Chapter 5 discusses the concepts, explains the logical process, and also provides a detailed real-world example. For
now, another real-world example, this one of authorization, will help you understand the concept better.
Real-World Example for Designing Hadoop Authorization
Suppose you are designing security for an insurance company’s claims management system, and you have to assign
roles and design fine-grained access for all the departments accessing this data. For this example, consider the
functional requirements of two departments: the call center and claims adjustors.
Call center representatives answer calls from customers and then file or record claims if they satisfy all the
stipulated conditions (e.g. damages resulting from “acts of God” do not qualify for claims and hence a claim can’t be
filed for them).
A claims adjustor looks at the filed claims and rejects those that violate any regulatory conditions. That adjustor
then submits the rest of the claims for investigation, assigning them to specialist adjustors. These adjustors evaluate
the claims based on company regulations and their specific functional knowledge to decide the final outcome.
Automated reporting programs pick up claims tagged with a final status “adjusted” and generate appropriate
letters to be mailed to the customers, informing them of the claim outcome. Figure 3-1 summarizes the system.
Call Center
representatives
Claims
Adjustors
Append
permissions
Read &
Append
permissions
Claims Data
Figure 3-1. Claims data and access needed by various departments
39
CHAPTER 3 ■ INTRODUCING HADOOP SECURITY
As you can see, call center representatives will need to append claim data and adjustors will need to modify data.
Since HDFS doesn’t have a provision for updates or deletes, adjustors will simply need to append a new record or row
(for a claim and its data) with updated data and a new version number. A scheduled process will need to generate
a report to look for adjusted claims and mail the final claim outcome to the customers. That process, therefore, will
need read access to the claims data.
In Hadoop, remember, data is stored in files. For this example, data is stored in file called Claims. Daily data is
stored temporarily in file called Claims_today and appended to the Claims file on a nightly basis. The call center folks
use the group ccenter, while the claims adjustors use the group claims, meaning the HDFS permissions on Claims
and Claims_today look like those shown in Figure 3-2.
--w--w---- 1 ccuser ccenter 1024 2013-11-17 10:50 Claims_today
-rw -rw ----
Permission groupings. First
byte indicates if it’s a
directory and next 3
1 cluser claims 102400 2013-11-16 22:00
Claims
File creation
or last
modification
date time
Only Claims Adjustors
have Read-Write
permissions to Claims
data file
Only Call
Center Reps
have write
groups of ‘rwx’ are for
owner, group and others;
where ‘r’ is Read, ‘w’ is
Write and ‘x’ is execute
permissions
File
owner
Group
File
owner
belongs
to
File
size
permissions to
Temporary
Claims data
file
Figure 3-2. HDFS file permissions
The first file, Claims_today, has write permissions for owner and the group ccuser. So, all the representatives
belonging to this group can write or append to this file.
The second file, Claims, has read and write permissions for owner and the group claims. So, all the claims
adjustors can read Claims data and append new rows for the claims that they have completed their work on, and for
which they are providing a final outcome. Also, notice that you will need to create a user named Reports within the
group claims for accessing the data for reporting.
Note The permissions discussed in the example are HDFS permissions and not the operating system permissions.
Hadoop follows a separate permissions model that appears to be the same as Linux, but the preceding permissions exist
within HDFS—not Linux.
So, do these permissions satisfy all the functional needs for this system? You can verify easily that they do. Of
course the user Reports has write permissions that he doesn’t need; but other than that, all functional requirements
are satisfied.
We will discuss this topic with a more detailed example in Chapter 5. As you have observed, the permissions you
assigned were limited to complete data files. However, in the real world, you may need your permissions granular
enough to access only parts of data files. How do you achieve that? The next section previews how.
40
Fine-Grained Authorization for Hadoop
CHAPTER 3 ■ INTRODUCING HADOOP SECURITY
Sometimes the necessary permissions for data don’t match the existing group structure for an organization. For
example, a bank may need a backup supervisor to have the same set of permissions as a supervisor, just in case the
supervisor is on vacation or out sick. Because the backup supervisor might only need a subset of the supervisor’s
permissions, it is not practical to design a new group for him or her. Also, consider another situation where
corporate accounts are being moved to a different department, and the group that’s responsible for migration
needs temporary access.
New versions of HDFS support ACL (Access Control List) functionality, and this will be very useful in such
situations. With ACLs you can specify read/write permissions for specific users or groups as needed. In the bank
example, if the backup supervisor needs write permission to a specific “personal accounts” file, then the HDFS ACL
feature can be used to provide the necessary write permission without making any other changes to file permissions.
For the migration scenario, the group that’s performing migration can be assigned read/write permissions using
HDFS ACL. In case you are familiar with POSIX ACLs, HDFS ACLs work exactly the same way. Chapter 5 discusses
Hadoop ACLs again in detail in the “Access Control Lists for HDFS” section.
Last, how do you configure permissions only for part of a data file or certain part of data? Maybe a user needs to
have access to nonsensitive information only. The only way you can configure further granularity (for authorization) is
by using a NoSQL database such as Hive and specialized software such as Apache Sentry. You can define parts of file
data as tables within Hive and then use Sentry to configure permissions. Sentry works with users and groups of users
(called groups) and lets you define rules (possible actions on tables such as read or write) and roles (a group of rules).
A user or group can have one or multiple roles assigned to them. Chapter 5 provides a real-world example using Hive
and Sentry that explains how fine-tuned authorization can be defined for Hadoop. “Role-Based Authorization with
Apache Sentry” in Chapter 5 also has architectural details for Apache Sentry.
Securely Administering HDFS
Chapters 4 and 5 will walk you through various techniques of authentication and authorization, which help secure
your system but are not a total solution. What if authorized users access resources that they are not authorized to
use, or unauthorized users access resources on a Hadoop cluster using unforeseen methods (read: hacking)? Secure
administration helps you deal with these scenarios by monitoring or auditing all access to your cluster. If you can’t
stop this type of access, you at least need to know it occurred! Hadoop offers extensive logging for all its processes
(also called daemons), and several open source tools can assist in monitoring a cluster. (Chapters 6 and 7 discuss
audit logging and monitoring in detail.)
Securely administering HDFS presents a number of challenges, due to the design of HDFS and the way it
is structured. Monitoring can help with security by alerting you of unauthorized access to any Hadoop cluster
resources. You then can design countermeasures for malicious attacks based on the severity of these alerts. Although
Hadoop provides metrics for this monitoring, they are cumbersome to use. Monitoring is much easier when you use
specialized software such as Nagios or Ganglia. Also, standard Hadoop distributions by Cloudera and Hortonworks
provide their own monitoring modules. Last, you can capture and monitor MapReduce counters.
Audit logs supplement the security by recording all access that flows through to your Hadoop cluster. You can
decide the level of logging (such as only errors, or errors and warnings, etc.), and advanced log management provided
by modules like Log4j provides a lot of control and flexibility for the logging process. Chapter 6 provides a detailed
overview (with an example) of the audit logging available with Hadoop. As a preview, the next section offers a brief
overview of Hadoop logging.
41
CHAPTER 3 ■ INTRODUCING HADOOP SECURITY
Using Hadoop Logging for Security
When a security issue occurs, having extensive activity logs available can help you investigate the problem. Before a
breach occurs, therefore, you should enable audit logging to track all access to your system. You can always filter out
information that’s not needed. Even if you have enabled authentication and authorization, auditing cluster activity
still has benefits. After all, even authorized users may perform tasks they are not authorized to do; for example, a user
with update permissions could update an entry without appropriate approvals. You have to remember, however, that
Hadoop logs are raw output. So, to make them useful to a security administrator, tools to ingest and process these
logs are required (note that some installations use Hadoop itself to analyze the audit logs, so you can use Hadoop to
protect Hadoop!).
Just capturing the auditing data is not enough. You need to capture Hadoop daemon data as well. Businesses
subject to federal oversight laws like the Health Information Portability and Accountability Act (HIPAA) and the
Sarbanes-Oxley Act (SOX) are examples of this need. For example, US law requires that all businesses covered
by HIPAA prevent unauthorized access to “Protected Health Information” (patients’ names, addresses, and all
information pertaining to the patients’ health and payment records) or applications that audit it. Businesses that must
comply with SOX (a 2002 US federal law that requires the top management of any US public company to individually
certify the accuracy of their firm’s financial information), must audit all access to any data object (e.g., table) within
an application. They also must monitor who submitted, managed, or viewed a job that can change any data within an
audited application. For business cases like these, you need to capture:
HDFS audit logs (to record all HDFS access activity within Hadoop),
MapReduce audit logs (record all submitted job activity), and
Hadoop daemon log files for NameNode, DataNode, JobTracker and TaskTracker.
The Log4j API is at the heart of Hadoop logging, be it audit logs or Hadoop daemon logs. The Log4j module
provides extensive logging capabilities and contains several logging levels that you can use to limit the outputting of
messages by category as well as limiting (or suppressing) the messages by their category. For example, if Log4j logging
level is defined as INFO for NameNode logging, then an event will be written to NameNode log for any file access
request that the NameNode receives (i.e. all the informational messages will be written to NameNode log file).
You can easily change the logging level for a Hadoop daemon at its URL. For example,
http://jobtracker-host:50030/logLevel will change the logging level while this daemon is running, but it will be
reset when it is restarted. If you encounter a problem, you can temporarily change the logging level for the appropriate
daemon to facilitate debugging. When the problem is resolved, you can reset the logging level. For a permanent
change to log level for a daemon, you need to change the corresponding property in the Log4j configuration file
(log4j.properties).
The Log4j architecture uses a logger (a named channel for log events such as NameNode, JobTracker, etc.), an
Appender (to which a log event is forwarded and which is responsible for writing it to console or a file), and a layout
(a formatter for log events). The logging levels—FATAL, ERROR, WARN, INFO, DEBUG, and TRACE—indicate the
severity of events in descending order. The minimum log level is used as a filter; log events with a log level greater
than or equal to that which is specified are accepted, while less severe events are simply discarded.
Figure 3-3 demonstrates how level filtering works. The columns show the logging levels, while the rows show
the level associated with the appropriate Logger Configuration. The intersection identifies whether the Event would
be allowed to pass for further processing (YES) or discarded (NO). Using Figure 3-3 you can easily determine what
category of events will be included in the logs, depending on the logging level configured. For example, if logging level
for NameNode is set at INFO, then all the messages belonging to the categories INFO, WARN, ERROR and FATAL will
be written to the NameNode log file. You can easily identify this, looking at the column INFO and observing the event
levels that are marked as YES. The levels TRACE and DEBUG are marked as NO and will be filtered out. If logging level
for JobTracker is set to FATAL, then only FATAL errors will be logged, as is obvious from the values in column FATAL.
42
CHAPTER 3 ■ INTRODUCING HADOOP SECURITY
Event | Logger Configuration level | |||||
Level | TRACE | DEBUG | INFO | WARN | ERROR | FATAL |
TRACE | YES | NO | NO | NO | NO | NO |
DEBUG | YES | YES | NO | NO | NO | NO |
INFO | YES | YES | YES | NO | NO | NO |
WARN | YES | YES | YES | YES | NO | NO |
ERROR | YES | YES | YES | YES | YES | NO |
FATAL | YES | YES | YES | YES | YES | YES |
Row data
shows possible
logging levels
that can be
used with
Hadoop
daemons or
services (such
as NameNode,
JobTracker
etc.)
Figure 3-3. Log4j logging levels and inclusions based on event levels
Column data
shows what
levels of
messages you
will actually
get in in your
log files for
that
configured log
level for a
Hadoop
service
Chapter 6 will cover Hadoop logging (as well as its use in investigating security issues) comprehensively.
You’ll get to know the main features of monitoring in the next section.
Monitoring for Security
When you think of monitoring, you probably think about possible performance issues that need troubleshooting or,
perhaps, alerts that can be generated if a system resource (such as CPU, memory, disk space) hits a threshold value.
You can, however, use monitoring for security purposes as well. For example, you can generate alerts if a user tries to
access cluster metadata or reads/writes a file that contains sensitive data, or if a job tries to access data it shouldn’t.
More importantly, you can monitor a number of metrics to gain useful security information.
It is more challenging to monitor a distributed system like Hadoop because the monitoring software has to
monitor individual hosts and then consolidate that data in the context of the whole system. For example, CPU
consumption on a DataNode is not as important as the CPU consumption on the NameNode. So, how will the system
process CPU consumption alerts, or be capable of identifying separate threshold levels for hosts with different roles
within the distributed system? Chapter 7 answers these questions in detail, but for now let’s have a look at the Hadoop
metrics that you can use for security purposes:
Activity statistics on the NameNode
Activity statistics for a DataNode
Detailed RPC information for a service
Health monitoring for sudden change in system resources
Tools of the Trade
The leading monitoring tools are Ganglia (http://ganglia.sourceforge.net) and Nagios (www.nagios.org). These
popular open source tools complement each other, and each has different strengths. Ganglia focuses on gathering
metrics and tracking them over a time period, while Nagios focuses more on being an alerting mechanism. Because
gathering metrics and alerting both are essential aspects of monitoring, they work best in conjunction. Both Ganglia
and Nagios have agents running on all hosts for a cluster and gather information.
43
CHAPTER 3 ■ INTRODUCING HADOOP SECURITY
Ganglia
Conceptualized at the University of California, Berkeley, Ganglia is an open source monitoring project meant to be
used with large distributed systems. Each host that’s part of the cluster runs a daemon process called gmond that
collects and sends the metrics (like CPU usage, memory usage, etc.) from the operating system to a central host. After
receiving all the metrics, the central host can display, aggregate, or summarize them for further use.
Ganglia is designed to integrate easily with other applications and gather statistics about their operations. For
example, Ganglia can easily receive output data from Hadoop metrics and use it effectively. Gmond (which Ganglia
has running on every host) has a very small footprint and hence can easily be run on every machine in the cluster
without affecting user performance.
Ganglia’s web interface (Figure 3-4) shows you the hardware used for the cluster, cluster load in the last hour,
CPU and memory resource consumption, and so on. You can have a look at the summary usage for last hour, day,
week, or month as you need. Also, you can get details of any of these resource usages as necessary. Chapter 7 will
discuss Ganglia in greater detail.

Nagios
Figure 3-4. Ganglia monitoring system: Cluster overview
Nagios provides a very good alerting mechanism and can use metrics gathered by Ganglia. Earlier versions of Nagios
polled information from its target hosts but currently it uses plug-ins that run agents on hosts (that are part of the
cluster). Nagios has an excellent built-in notification system and can be used to deliver alerts via pages or e-mails for
certain events (e.g., NameNode failure or disk full). Nagios can monitor applications, services, servers, and network
infrastructure. Figure 3-5 shows the Nagios web interface, which can easily manage status (of monitored resources),
alerts (defined on resources), notifications, history, and so forth.
44
CHAPTER 3 ■ INTRODUCING HADOOP SECURITY
Host status
summary – based
on hosts
selected for
monitoring
Ho
mo
det
sum
sta


![]()
Main Menu that
lets you choose
the high level task
you have in mind
st
nitoring
ails with
mary
tus
Figure 3-5. Nagios web interface for monitoring
The real strength of Nagios is the hundreds of user-developed plug-ins that are freely available to use. Plug-
ins are available in all categories. For example, the System Metrics category contains the subcategory Users, which
contains plug-ins such as Show Users that can alert you when certain users either log in or don’t. Using these plug-
ins can cut down valuable customization time, which is a major issue for all open source (and non–open source)
software. Chapter 7 discusses the details of setting up Nagios.
Encryption: Relevance and Implementation for Hadoop
Being a distributed system, Hadoop has data spread across a large number of hosts and stored locally. There is a
large amount of data communication between these hosts; hence data is vulnerable in transit as well as when at rest
and stored on local storage. Hadoop started as a data store for collecting web usage data as well as other forms of
nonsensitive large-volume data. That’s why Hadoop doesn’t have any built-in provision for encrypting data.
Today, the situation is changing and Hadoop is increasingly being used to store sensitive warehoused data in the
corporate world. This has created a need for the data to be encrypted in transit and at rest. Now there are a number of
alternatives available to help you encrypt your data.
Encryption for Data in Transit
Internode communication in Hadoop uses protocols such as RPC, TCP/IP, and HTTP. RPC communication can
be encrypted using a simple Hadoop configuration option and is used for communication between NameNode,
JobTracker, DataNodes, and Hadoop clients. That leaves the actual read/write of file data between clients and
DataNodes (TCP/IP) and HTTP communication (web consoles, communication between NameNode/Secondary
NameNode, and MapReduce shuffle data) unencrypted.
It is possible to encrypt TCP/IP or HTTP communication, but that requires use of Kerberos or SASL frameworks.
The current version of Hadoop allows network encryption (in conjunction with Kerberos) by setting explicit values
in the configuration files core-site.xml and hdfs-site.xml. Chapter 4 will revisit this detailed setup and discuss
network encryption at length.
45
CHAPTER 3 ■ INTRODUCING HADOOP SECURITY
Encryption for Data at Rest
There are a number of choices for implementing encryption at rest with Hadoop, but they are offered by different
vendors and rely on their distributions to implement encryption. Most notable are the Intel Project Rhino (committed
to the Apache Software Foundation and open source) and AWS (Amazon Web Services) offerings, which provide
encryption for data stored on disk.
Because Hadoop usually deals with large volumes of data and encryption/decryption takes time, it is important
that the framework used performs the encryption/decryption fast enough that it doesn’t impact performance. The
Intel solution (shortly to be offered through the Cloudera distribution) claims to perform these operations with great
speed—provided that Intel CPUs are used along with Intel disk drives and all the other related hardware. Let’s have a
quick look at some details of Amazon’s encryption “at rest” option.
AWS encrypts data stored within HDFS and also supports encrypted data manipulation by other components
such as Hive or HBase. This encryption can be transparent to users (if the necessary passwords are stored in
configuration files) or can prompt them for passwords before allowing access to sensitive data, can be applied on a
file-by-file basis, and can work in combination with external key management applications. This encryption can use
symmetric as well as asymmetric keys. To use this encryption, sensitive files must be encrypted using a symmetric or
asymmetric key before they are stored in HDFS.
When an encrypted file is stored within HDFS, it remains encrypted. It is decrypted as needed for processing
and re-encrypted before it is moved back into storage. The results of the analysis are also encrypted, including
intermediate results. Data and results are neither stored nor transmitted in unencrypted form. Figure 3-6 provides
an overview of the process. Data stored in HDFS is encrypted using symmetric keys, while MapReduce jobs use
symmetric keys (with certificates) for transferring encrypted data.
Client
application
Client
application
Decrypted
Data
Data requested by the
client application is
decrypted and
delivered
Encrypted
Data
Decrypted
Data
Encrypted
Data
Encrypted Data (stored in HDFS)
Data written back
by the client
application is
encrypted before
write back to
HDFS
Figure 3-6. Details of at-rest encryption provided by Intel’s Hadoop distribution (now Project Rhino)
Chapter 8 will cover encryption in greater detail. It provides an overview of encryption concepts and protocols
and then briefly discusses two options for implementing encryption: using Intel’s distribution (now available as
Project Rhino) and using AWS to provide transparent encryption.
46
Summary
CHAPTER 3 ■ INTRODUCING HADOOP SECURITY
With a roadmap in hand, finding where you want to go and planning how to get there is much easier. This chapter
has been your roadmap to techniques for designing and implementing security for Hadoop. After an overview of
Hadoop architecture, you investigated authentication using Kerberos to provide secure access. You then learned how
authorization is used to specify the level of access, and that you need to follow a multistep process of analyzing data
and needs to define an effective authorization strategy.
To supplement your security through authentication and authorization, you need to monitor for unauthorized
access or unforeseen malicious attacks continuously; tools like Ganglia or Nagios can help. You also learned the
importance of logging all access to Hadoop daemons using the Log4j logging system and Hadoop daemon logs as well
as audit logs.
Last, you learned about encryption of data in transit (as well as at rest) and why it is important as an additional
level of security—because it is the only way to stop unauthorized access for hackers that have bypassed authentication
and authorization layers. To implement encryption for Hadoop, you can use solutions from AWS (Amazon web
services) or Intel’s Project Rhino.
For the remainder of the book, you’ll follow this roadmap, digging deeper into each of the topics presented in this
chapter. We’ll start in Chapter 4 with authentication.
47
PART II
Authenticating and Authorizing
Within Your Hadoop Cluster
CHAPTER 4
Open Source Authentication
in Hadoop
In previous chapters, you learned what a secure system is and what Hadoop security is missing in comparison to
what the industry considers a secure system—Microsoft SQL Server (a relational database system). This chapter
will focus on implementing some of the features of a secure system to secure your Hadoop cluster from all the Big
Bad Wolves out there. Fine-tuning security is more art than a science. There are no rules as to what is “just right”
for an environment, but you can rely on some basic conventions to help you get closer—if not “just right.” For
example, because Hadoop is a distributed system and is mostly accessed using client software on a Windows PC, it
makes sense to start by securing the client. Next, you can think about securing the Hadoop cluster by adding strong
authentication, and so on.
Before you can measure success, however, you need a yardstick. In this case, you need a vision of the ideal
Hadoop security setup. You’ll find the details in the next section.
Pieces of the Security Puzzle
Figure 4-1 diagrams an example of an extensive security setup for Hadoop. It starts with a secure client. The SSH protocol
secures the client using key pairs; the server uses a public key, and the client uses a private key. This is to counter
spoofing (intercepting and redirecting a connection to an attacker’s system) and also a hacked or compromised
password. You’ll delve deeper into the details of secure client setup in the upcoming “Establishing Secure Client
Access” section. Before the Hadoop system allows access, it authenticates a client using Kerberos (an open-source
application used for authentication). You’ll learn how to set up Kerberos and make it work with Hadoop in the section
“Building Secure User Authentication.”
Once a user is connected, the focus is on limiting permissions as per the user’s role. The user in Figure 4-1
has access to all user data except sensitive salary data. You can easily implement this by splitting the data into
multiple files and assigning appropriate permissions to them. Chapter 5 focuses on these authorization issues
and more.
51
CHAPTER 4 ■ OPEN SOURCE AUTHENTICATION IN HADOOP
![]()


![]()
![]()
Hadoop
Authentication using
Kerberos
Secure Inter-process
communication between
NameNode & DataNodes
User Data
NameNode
DataNode1
Name Location Salary
John Doe Chicago 10,000
Jane Doe Elgin 5,000
Mike Dey Itasca 3,000
Al Gore Boston 20,000
Jay Leno Frisco 15,000
…………………………………
DataNode2
Hadoop Authorizes
access to user data–
excluding sensitive data

Secure Client
tries to
access
All Hadoop
data
Figure 4-1. Ideal Hadoop Security, with all the required pieces in place
You will also observe that inter-process communication between various Hadoop processes (e.g., between
NameNode and DataNodes) is secure, which is essential for a distributed computing environment. Such an
environment involves a lot of communication between various hosts, and unsecured data is open to various
types of malicious attacks. The final section of this chapter explores how to secure or encrypt the inter-process
traffic in Hadoop.
These are the main pieces of the Hadoop security puzzle. One piece that’s missing is encryption for data at rest,
but you’ll learn more about that in Chapter 8.
Establishing Secure Client Access
Access to a Hadoop cluster starts at the client you use, so start by securing the client. Unsecured data is open
to malicious attacks that can result in data being destroyed or stolen for unlawful use. This danger is greater for
distributed systems (such as Hadoop) that have data blocks spread over a large number of nodes. A client is like a
gateway to the actual data. You need to secure the gate before you can think about securing the house.
OpenSSH or SSH protocol is commonly used to secure a client by using a login/password or keys for access.
Keys are preferable because a password can be compromised, hacked, or spoofed. For both Windows-based and
Linux-based clients, PuTTY (www.chiark.greenend.org.uk/~sgtatham/putty) is an excellent open-source client that
supports the SSH protocol. Besides being free, a major advantage to PuTTY is its ability to allow access using keys and
a passphrase instead of password (more on the benefits of this coming up). Assistance in countering spoofing is a less
obvious, yet equally important additional benefit of PuTTY that deserves your attention.
52
Countering Spoofing with PuTTY’s Host Keys
CHAPTER 4 ■ OPEN SOURCE AUTHENTICATION IN HADOOP
Spoofing, as you remember, is a technique used to extract your personal information (such as a password) for possible
misuse, by redirecting your connection to the attacker’s computer (instead of the one you think you are connected to),
so that you send your password to the attacker’s machine. Using this technique, attackers get access to your password,
log in, and use your account for their own malicious purposes.
To counter spoofing, a unique code (called a host key) is allocated to each server. The way these keys are created,
it’s not possible for a server to forge another server’s key. So if you connect to a server and it sends you a different host
key (compared to what you were expecting), SSH (or a secure client like PuTTY that is using SSH) can warn you that
you are connected to a different server—which could mean a spoofing attack is in progress!
PuTTY stores the host key (for servers you successfully connect to) via entries in the Windows Registry. Then, the
next time you connect to a server to which you previously connected, PuTTY compares the host key presented by the
server with the one stored in the registry from the last time. If it does not match, you will see a warning and then have
a chance to abandon your connection before you provide a password or any other private information.
However, when you connect to a server for the first time, PuTTY has no way of checking if the host key is the right
one or not. So it issues a warning that asks whether you want to trust this host key or not:
The server's host key is not cached in the registry. You
have noguarantee that the server is the computer you
think it is.
The server's rsa2 key fingerprint is:
ssh-rsa 1024 5c:d4:6f:b7:f8:e9:57:32:3d:a3:3f:cf:6b:47:2c:2a
If youtrust this host, hit Yes to add the key to
PuTTY's cache and carry on connecting.
If you want to carry on connecting just once, without
adding thekey to the cache, hit No.
If you do not trust this host, hit Cancel to abandon the
connection.
If the host is not known to you or you have any doubts about whether the host is the one you want to connect to,
you can cancel the connection and avoid being a victim of spoofing.
Key-Based Authentication Using PuTTY
Suppose a super hacker gets into your network and gains access to the communication from your client to the server
you wish to connect to. Suppose also that this hacker captures the host authentication string that the real host sends
to your client and returns it as his own to get you to connect to his server instead of the real one. Now he can easily get
your password and can use that to access sensitive data.
How can you stop such an attack? The answer is to use key-based authentication instead of a password. Without
the public key, the hacker won’t be able to get access!
One way to implement keys for authentication is to use SSH, which is a protocol used for communicating
securely over a public channel or a public, unsecured network. The security of communication relies on a key pair
used for encryption and decryption of data. SSH can be used (or implemented) in several ways. You can automatically
generate a public/private key pair to encrypt a network connection and then use password authentication to log on.
Another way to use SSH is to generate a public/private key pair manually to perform the authentication, which will
allow users or programs to log in without specifying a password.
For Windows-based clients, you can generate the key pair using PuTTYgen, which is open source and freely
available. Key pairs consist of a public key, which is copied to the server, and a private key, which is located on the
secure client.
53
CHAPTER 4 ■ OPEN SOURCE AUTHENTICATION IN HADOOP
The private key can be used to generate a new signature. A signature generated with a private key cannot be
forged by anyone who does not have that key. However, someone who has the corresponding public key can check
if a particular signature is genuine.
Server running OpenSSH
Public
Key
Secure client
Private
Key
Private Key can be further
secured using a Passphrase
![]()
![]()
PuTTY generates a
signature (using private
key) and sends it to server
When using a key pair for authentication, PuTTY can generate a signature using your private key (specified
using a key file). The server can check if the signature is genuine (using your public key) and allow you to log in. If
your client is being spoofed, all that the attacker intercepts is a signature that can’t be reused, but your private key or
password is not compromised. Figure 4-2 illustrates the authentication process.
Server authenticates using
the Public key and allows
login (if successful)
Figure 4-2. Key-based authentication using PuTTY
To set up key-based authentication using PuTTY, you must first select the type of key you want. For the example,
I’ll use RSA and set up a key pair that you can use with a Hadoop cluster. To set up a key pair, open the PuTTY Key
Generator (PuTTYgen.exe). At the bottom of the window, select the parameters before generating the keys. For
example, to generate an RSA key for use with the SSH-2 protocol, select SSH-2 RSA under Type of key to generate.
The value for Number of bits in a generated key determines the size or strength of the key. For this example, 1024 is
sufficient, but in a real-world scenario, you might need a longer key such as 2048 for better security. One important
thing to remember is that a longer key is more secure, but the encryption/decryption processing time increases with
the key length. Enter a key passphrase (to encrypt your private key for protection) and make a note of it since you will
need to use it later for decryption.
Note The most common public-key algorithms available for use with PuTTY are RSA and DSA. PuTTY developers
strongly recommend you use RSA; DSA (also known as DSS, the United States’ federal Digital Signature Standard) has an
intrinsic weakness that enables easy creation of a signature containing enough information to give away the private key.
(To better understand why RSA is almost impossible to break, see Chapter 8.)
Next, click the Generate button. In response, PuTTYgen asks you to move the mouse around to generate
randomness (that’s the PuTTYgen developers having fun with us!). Move the mouse in circles over the blank area in
the Key window; the progress bar will gradually fill as PuTTYgen collects enough randomness and keys are generated
as shown in Figure 4-3.
54
CHAPTER 4 ■ OPEN SOURCE AUTHENTICATION IN HADOOP

Figure 4-3. Generating a key pair for implementing secure client
Once the keys are generated, click the Save public key and Save private key buttons to save the keys.
Next, you need to copy the public key to the file authorized_keys located in the .ssh directory under your
home directory on the server you are trying to connect to. For that purpose, please refer to the section Public key for
pasting into Open SSH authorized_keys file in Figure 4-3. Move your cursor to that section and copy all the text
(as shown). Then, open a PuTTY session and connect using your login and password. Change to directory .ssh
and open the authorized_keys file using editor of your choice. Paste the text of the public key that you created with
PuTTYgen into the file, and save the file (Figure 4-4).
55
CHAPTER 4 ■ OPEN SOURCE AUTHENTICATION IN HADOOP

Figure 4-4. Pasting the public key in authorized_keys file
Using Passphrases
What happens if someone gets access to your computer? They can generate signatures just as you would. Then,
they can easily connect to your Hadoop cluster using your credentials! This can of course be easily avoided by
using the passphrase of your choice to encrypt your private key before storing it on your local machine. Then, for
generating a signature, PuTTY will need to decrypt the key and that will need your passphrase, thereby preventing any
unauthorized access.
Now, the need to type a passphrase whenever you log in can be inconvenient. So, Putty provides Pageant, which
is an authentication agent that stores decrypted private keys and uses them to generate signatures as requested. All
you need to do is start Pageant and enter your private key along with your passphrase. Then you can invoke PuTTY
any number of times; Pageant will automatically generate the signatures. This arrangement will work until you restart
your Windows client. Another nice feature of Pageant is that when it shuts down, it will never store your decrypted
private key on your local disk.
So, as a last step, configure your PuTTY client to use the private key file instead of a password for authentication
(Figure 4-5). Click the + next to option SSH to open the drill-down and then click option Auth (authorization)
under that. Browse and select the private key file you saved earlier (generated through PuTTYgen). Click Open to
open a new session.
56
CHAPTER 4 ■ OPEN SOURCE AUTHENTICATION IN HADOOP



Using a private
key file (instead
of password) for
authentication
Figure 4-5. Configuration options for private key authentication with PuTTY
Now you are ready to be authenticated by the server using login and passphrase as shown in Figure 4-6. Enter the
login name at the login prompt (root in this case) and enter the passphrase to connect!

Figure 4-6. Secure authentication using login and a passphrase
In some situations (e.g., scheduled batch processing), it will be impossible to type the passphrase; at those
times, you can start Pageant and load your private key into it by typing your passphrase once. Please refer to
Appendix A for an example of Pageant use and implementation and Appendix B for PuTTY implementation for
Linux-based clients.
57
CHAPTER 4 ■ OPEN SOURCE AUTHENTICATION IN HADOOP
Building Secure User Authentication
A secure client connection is vital, but that’s only a good starting point. You need to secure your Hadoop cluster when
this secure client connects to it. The user security process starts with authenticating a user. Although Hadoop itself has
no means of authenticating a user, currently all the major Hadoop distributions are available with Kerberos installed,
and Kerberos provides authentication.
With earlier versions of Hadoop, when a user tried to access a Hadoop cluster, Hadoop simply checked the ACL
to ensure that the underlying OS user was allowed access, and then provided this access. This was not a very secure
option, nor did it limit access for a user (since a user could easily impersonate the Hadoop superuser). The user then
had access to all the data within a Hadoop cluster and could modify or delete it if desired. Therefore, you need to
configure Kerberos or another similar application to authenticate a user before allowing access to data—and then, of
course, limit that access, too!
Kerberos is one of the most popular options used with Hadoop for authentication. Developed by MIT, Kerberos
has been around since the 1980s and has been enhanced multiple times. The current version, Kerberos version
5 was designed in 1993 and is freely available as an open source download. Kerberos is most commonly used for
securing Hadoop clusters and providing secure user authentication. In this section you’ll learn how Kerberos
works, what its main components are, and how to install it. After installation, I will discuss a simple Kerberos
implementation for Hadoop.
Kerberos Overview
Kerberos is an authentication protocol for “trusted hosts on untrusted networks.” It simply means that Kerberos
assumes that all the hosts it’s communicating with are to be trusted and that there is no spoofing involved or that the
secret key it uses is not compromised. To use Keberos more effectively, consider a few other key facts:
Kerberos continuously depends on a central server. If the central server is unavailable, no
one can log in. It is possible to use multiple “central” servers (to reduce the risk) or additional
authentication mechanisms (as fallback).
Kerberos is heavily time dependent, and thus the clocks of all the governed hosts must be
synchronized within configured limits (5 minutes by default). Most of the times, Network Time
Protocol daemons help to keep the clocks of the governed hosts synchronized.
Kerberos offers a single sign-on approach. A client needs to provide a password only once per
session and then can transparently access all authorized services.
Passwords should not be saved on clients or any intermediate application servers. Kerberos
stores them centrally without any redundancy.
Figure 4-7 provides an overview of Kerberos authentication architecture. As shown, the Authentication Server
and Ticket Granting Server are major components of the Kerberos key distribution center.
58
1
Client requests
authentication to AS
AS responds
CHAPTER 4 ■ OPEN SOURCE AUTHENTICATION IN HADOOP
Authentication
Server (AS)
center (KDC)
Kerberos
internal
Database
KDC
Client
“Kerberized”
service
5 Client accesses “Kerberized”(secure)
service
Ticket Granting
Server (TGS)
![]()
![]()
![]()
![]()
![]()
![]()
2
with TGT
Client uses TGT to request
service ticket
TGS Provides Service ticket
for authentication
4
3
Figure 4-7. Kerberos key distribution center with its main components (TGT = Ticket Granting Ticket)
A client requests access to a Kerberos-enabled service using Kerberos client libraries. The Kerberos client
contacts the Kerberos Distribution Center, or KDC (the central Kerberos server that hosts the credential database)
and requests access. If the provided credentials are valid, KDC provides requested access. The KDC uses an internal
database for storing credentials, along with two main components: the Authentication Server (AS) and the Ticket
Granting Server (TGS).
Authentication
The Kerberos authentication process contains three main steps:
The AS grants the user (and host) a Ticket Granting Ticket (TGT) as an authentication
token. A TGT is valid for a specific time only (validity is configured by Administrator
through the configuration file). In case of services principles (logins used to run services or
background processes) requesting TGT, credentials are supplied to the AS through special
files called keytabs.
The client uses credentials to decrypt the TGT and then uses the TGT to get service ticket
from the Ticket Granting Server to access a “kerberized” service. A client can use the same
TGT for multiple TGS requests (till the TGT expires).
The user (and host) uses the service ticket to authenticate and access a specific Kerberos-
enabled service.
59
CHAPTER 4 ■ OPEN SOURCE AUTHENTICATION IN HADOOP
Important Terms
To fully understand Kerberos, you need to speak its language of realms, principals, tickets, and databases. For an
example of a Kerberos implementation, you are implementing Kerberos on a single node cluster called pract_hdp_
sec, and you are using a virtual domain or realm called EXAMPLE.COM.
The term realm indicates an administrative domain (similar to a Windows domain) used for authentication. Its
purpose is to establish the virtual boundary for use by an AS to authenticate a user, host, or service. This does not
mean that the authentication between a user and a service forces them to be in the same realm! If the two objects
belong to different realms but have a trust relationship between them, then the authentication can still proceed
(called cross-authentication). For our implementation, I have created a single realm called EXAMPLE.COM (note that by
convention a realm typically uses capital letters).
A principal is a user, host, or service associated with a realm and stored as an entry in the AS database typically
located on KDC. A principal in Kerberos 5 is defined using the following format: Name[/Instance]@REALM. Common
usage for users is username@REALM or username/role@REALM (e.g., alex/admin@REALM and alex@REALM are two different
principals that might be defined). For service principals, the common format is service/hostname@REALM (e.g.,
hdfs/host1.myco.com). Note that Hadoop expects a specific format for its service principals. For our implementation,
I have defined principles, such as hdfs/pract_hdp_sec@EXAMPLE.COM (hdfs for NameNode and DataNode),
mapred/pract_hdp_sec@EXAMPLE.COM (mapred for JobTracker and TaskTracker), and so on.
A ticket is a token generated by the AS when a client requests authentication. Information in a ticket includes:
the requesting user’s principal (generally the username), the principal of the service it is intended for, the client’s
IP address, validity date and time (in timestamp format), ticket's maximum lifetime, and session key (this has a
fundamental role). Each ticket expires, generally after 24 hours, though this is configurable for a given Kerberos
installation.
In addition, tickets may be renewed by user request until a configurable time period from issuance (e.g., 7 days
from issue). Users either explicitly use the Kerberos client to obtain a ticket or are provided one automatically if the
system administrator has configured the login client (e.g., SSH) to obtain the ticket automatically on login. Services
typically use a keytab file (a protected file having the services’ password contained within) to run background threads
that obtain and renew the TGT for the service as needed. All Hadoop services will need a keytab file placed on their
respective hosts, with the location of this file being defined in the service site XML.
Kerberos uses an encrypted database to store all the principal entries associated with users and services. Each
entry contains the following information: principal name, encryption key, maximum validity for a ticket associated
with a principal, maximum renewal time for a ticket associated with a principal, password expiration date, and
expiration date of the principal (after which no tickets will be issued).
There are further details associated with Kerberos architecture, but because this chapter focuses on installing and
configuring Kerberos for Hadoop, basic understanding of Kerberos architecture will suffice for our purposes. So let’s
start with Kerberos installation.
Installing and Configuring Kerberos
The first step for installing Kerberos is to install all the Kerberos services for your new KDC. For Red Hat Enterprise
Linux (RHEL) or CentOS operating systems, use this command:
yum install krb5-server krb5-libs krb5-auth-dialog krb5-workstation
When the server is installed, you must edit the two main configuration files, located by default in the following
directories (if not, use Linux utility “find” to locate them):
/etc/krb5.conf
/var/kerberos/krb5kdc/kdc.conf
60
CHAPTER 4 ■ OPEN SOURCE AUTHENTICATION IN HADOOP
The next phase is to specify your realm (EXAMPLE.COM for the example) and to change the KDC value to the
name of the fully qualified Kerberos server host (here, pract_hdp_sec). You must also copy the updated version of
/etc/krb5.conf to every node in your cluster. Here is /etc/krb5.conf for our example:
[logging]
default = FILE:/var/log/krb5libs.log
kdc =FILE:/var/log/krb5kdc.log
admin_server = FILE:/var/log/kadmind.log
[libdefaults]
default_realm = EXAMPLE.COM
dns_lookup_realm =false
dns_lookup_kdc = false
ticket_lifetime = 24h
renew_lifetime = 7d
forwardable = true
[kdc]
profile = /var/kerberos/krb5kdc/kdc.conf
[realms]
EXAMPLE.COM = {
kdc = pract_hdp_sec
admin_server = pract_hdp_sec
}
[domain_realm]
.example.com = EXAMPLE.COM
example.com =EXAMPLE.COM
Please observe the changed values for the realm name and KDC name. The example tickets will be valid for
up to 24 hours after creation, so ticket_lifetime is set to 24h. After 7 days those tickets can be renewed, because
renew_lifetime is set to 7d. Following is the /var/kerberos/krb5kdc/kdc.conf I am using:
[kdcdefaults]
kdc_ports = 88
kdc_tcp_ports = 88
[realms]
EXAMPLE.COM = {
profile = /etc/krb5.conf
supported_enctypes = aes128-cts:normal des3-hmac-sha1:normal
arcfour-hmac:normal des-hmac-sha1:normal des-cbc-md5:normal des-cbc-crc:normal
allow-null-ticket-addresses =true
database_name = /var/Kerberos/krb5kdc/principal
#master_key_type =aes256-cts
acl_file = /var/kerberos/krb5kdc/kadm5.acl
admin_keytab =/var/kerberos/krb5kdc/kadm5.keytab
dict_file = /usr/share/dict/words
max_life = 2d 0h 0m 0s
max_renewable_life =7d 0h 0m 0s
61
CHAPTER 4 ■ OPEN SOURCE AUTHENTICATION IN HADOOP
admin_database_lockfile = /var/kerberos/krb5kdc/kadm5_adb.lock
key_stash_file =/var/kerberos/krb5kdc/.k5stash
kdc_ports = 88
kadmind_port = 749
default_principle_flags =+renewable
}
Included in the settings for realm EXAMPLE.COM, the acl_file parameter specifies the ACL (file /var/kerberos/
krb5kdc/kadm5.acl in RHEL or CentOS) used to define the principals that have admin (modifying) access to the
Kerberos database. The file can be as simple as a single entry:
*/admin@EXAMPLE.COM *
This entry specifies all principals with the /admin instance extension have full access to the database. Kerberos
service kadmin needs to be restarted for the change to take effect.
Also, observe that the max_life (maximum ticket life) setting is 2d (2 days) for the realm EXAMPLE.COM. You can
override configuration settings for specific realms. You can also specify these values for a principal.
Note in the [realms] section of the preceding code that I have disabled 256-bit encryption. If you want to use
256-bit encryption, you must download the Java Cryptography Extension (JCE) and follow the instructions to install
it on any node running Java processes using Kerberos (for Hadoop, all cluster nodes). If you want to skip this and just
use 128-bit encryption, remove the line #master_key_type = aes256-cts and remove the references to aes-256
before the generation of your KDC master key, as described in the section “Creating a Database.”
This concludes installing and setting up Kerberos. Please note that it’s not possible to cover all the possible
options (operating systems, versions, etc.) and nuances of Kerberos installation in a single section. For a more
extensive discussion of Kerberos installation, please refer to MIT’s Kerberos installation guide at
http://web.mit.edu/kerberos/krb5-1.6/krb5-1.6/doc/krb5-install.html. O’Reilly’s Kerberos: The Definitive Guide
is also a good reference.
Getting back to Kerberos implementation, let me create a database and set up principals (for use with Hadoop).
Preparing for Kerberos Implementation
Kerberos uses an internal database (stored as a file) to save details of principals that are set up for use. This database
contains users (principals) and their private keys. Principals include internal users that Kerberos uses as well as those
you define. The database file is stored at location defined in configuration file kdc.conf file; for this example,
/var/kerberos/krb5kdc/principal.
Creating a Database
To set up a database, use the utility kdb5_util:
kdb5_util create -r EXAMPLE.COM –s
You will see a response like:
Loading random data
Initializing database '/var/kerberos/krb5kdc/principal' for realm 'EXAMPLE.COM',
master keyname 'K/M@EXAMPLE.COM'
You will be prompted for the database Master Password.
It isimportant that you NOT FORGET this password.
Enter KDC database master key:
Re-enter KDC database master key to verify:
62
CHAPTER 4 ■ OPEN SOURCE AUTHENTICATION IN HADOOP
Please make a note of the master key. Also, please note that -s option allows you to save the master server key
for the database in a stash file (defined using parameter key_stash_file in kdc.conf). If the stash file doesn’t exist,
you need to log into the KDC with the master password (specified during installation) each time it starts. This will
automatically regenerate the master server key.
Now that the database is created, create the first user principal. This must be done on the KDC server itself, while
you are logged in as root:
/usr/sbin/kadmin.local -q "addprinc root/admin"
You will be prompted for a password. Please make a note of the password for principal root/admin@EXAMPLE.COM.
You can create other principals later; now, it’s time to start Kerberos. To do so for RHEL or CentOS operating systems,
issue the following commands to start Kerberos services (for other operating systems, please refer to appropriate
command reference):
/sbin/service kadmin start
/sbin/service krb5kdc start
Creating Service Principals
Next, I will create service principals for use with Hadoop using the kadmin utility. Principal name hdfs will be used for
HDFS; mapred will be used for MapReduce, HTTP for HTTP, and yarn for YARN-related services (in this code, kadmin:
is the prompt; commands are in bold):
[root@pract_hdp_sec]# kadmin
Authenticating as principal root/admin@EXAMPLE.COM with password.
Password forroot/admin@EXAMPLE.COM:
kadmin: addprinc -randkey hdfs/pract_hdp_sec@EXAMPLE.COM
Principal "hdfs/pract_hdp_sec@EXAMPLE.COM" created.
kadmin: addprinc -randkey mapred/pract_hdp_sec@EXAMPLE.COM
Principal "mapred/pract_hdp_sec@EXAMPLE.COM" created.
kadmin: addprinc -randkey HTTP/pract_hdp_sec@EXAMPLE.COM
Principal "HTTP/pract_hdp_sec@EXAMPLE.COM" created.
kadmin: addprinc -randkey yarn/pract_hdp_sec@EXAMPLE.COM
Principal "yarn/pract_hdp_sec@EXAMPLE.COM" created.
kadmin:
Creating Keytab Files
Keytab files are used for authenticating services non-interactively. Because you may schedule the services to run
remotely or at specific time, you need to save the authentication information in a file so that it can be compared with
the Kerberos internal database. Keytab files are used for this purpose.
Getting back to file creation, extract the related keytab file (using kadmin) and place it in the keytab directory
(/etc/security/keytabs) of the respective components (kadmin: is the prompt; commands are in bold):
[root@pract_hdp_sec]# kadmin
Authenticating as principal root/admin@EXAMPLE.COM with password.
Password forroot/admin@EXAMPLE.COM:
kadmin: xst -k mapred.keytab hdfs/pract_hdp_sec@EXAMPLE.COM HTTP/pract_hdp_sec@EXAMPLE.COM
Entry for principal hdfs/pract_hdp_sec@EXAMPLE.COM with kvno 5, encryption type aes128-cts-hmac-
sha1-96 addedto keytab WRFILE:mapred.keytab.
63
CHAPTER 4 ■ OPEN SOURCE AUTHENTICATION IN HADOOP
Entry for principal hdfs/pract_hdp_sec@EXAMPLE.COM with kvno 5, encryption type des3-cbc-sha1
added tokeytab WRFILE:mapred.keytab.
Entry for principal hdfs/pract_hdp_sec@EXAMPLE.COM with kvno 5, encryption type arcfour-hmac
added tokeytab WRFILE:mapred.keytab.
Entry for principal hdfs/pract_hdp_sec@EXAMPLE.COM with kvno 5, encryption type des-hmac-sha1
added tokeytab WRFILE:mapred.keytab.
Entry for principal hdfs/pract_hdp_sec@EXAMPLE.COM with kvno 5, encryption type des-cbc-md5
added tokeytab WRFILE:mapred.keytab.
Entry for principal HTTP/pract_hdp_sec@EXAMPLE.COM with kvno 4, encryption type aes128-cts-hmac-
sha1-96 addedto keytab WRFILE:mapred.keytab.
Entry for principal HTTP/pract_hdp_sec@EXAMPLE.COM with kvno 4, encryption type des3-cbc-sha1
added tokeytab WRFILE:mapred.keytab.
Entry for principal HTTP/pract_hdp_sec@EXAMPLE.COM with kvno 4, encryption type arcfour-hmac
added tokeytab WRFILE:mapred.keytab.
Entry for principal HTTP/pract_hdp_sec@EXAMPLE.COM with kvno 4, encryption type des-hmac-sha1
added tokeytab WRFILE:mapred.keytab.
Entry for principal HTTP/pract_hdp_sec@EXAMPLE.COM with kvno 4, encryption type des-cbc-md5
added tokeytab WRFILE:mapred.keytab.
Please observe that key entries for all types of supported encryption (defined in configuration file kdc.conf as
parameter supported_enctypes) are added to the keytab file for the principals.
Getting back to keytab creation, create keytab files for the other principals (at the kadmin prompt) as follows:
kadmin:xst -k mapred.keytab hdfs/pract_hdp_sec@EXAMPLE.COM http/pract_hdp_sec@EXAMPLE.COM
kadmin:xst -k yarn.keytab hdfs/pract_hdp_sec@EXAMPLE.COM http/pract_hdp_sec@EXAMPLE.COM
You can verify that the correct keytab files and principals are associated with the correct service using the klist
command. For example, on the NameNode:
[root@pract_hdp_sec]# klist -kt mapred.keytab
Keytab name: FILE:mapred.keytab
KVNO | Timestamp | Principal | |||
5 | 10/18/14 | 12:42:21 | |||
5 | 10/18/14 | 12:42:21 | |||
5 | 10/18/14 | 12:42:21 | |||
5 | 10/18/14 | 12:42:21 | |||
5 | 10/18/14 | 12:42:21 | |||
4 | 10/18/14 | 12:42:21 | HTTP/pract_hdp_sec@EXAMPLE.COM | ||
4 | 10/18/14 | 12:42:21 | HTTP/pract_hdp_sec@EXAMPLE.COM | ||
4 | 10/18/14 | 12:42:21 | HTTP/pract_hdp_sec@EXAMPLE.COM | ||
4 | 10/18/14 | 12:42:21 | HTTP/pract_hdp_sec@EXAMPLE.COM | ||
4 | 10/18/14 | 12:42:21 | HTTP/pract_hdp_sec@EXAMPLE.COM |
So far, you have defined principals and extracted keytab files for HDFS, MapReduce, and YARN-related principals
only. You will need to follow the same process and define principals for any other component services running on
your Hadoop cluster such as Hive, HBase, Oozie, and so on. Note that the principals for web communication must be
named HTTP as web-based protocol implementations for using Kerberos require this naming.
64
CHAPTER 4 ■ OPEN SOURCE AUTHENTICATION IN HADOOP
For deploying the keytab files to slave nodes, please copy (or move if newly created) the keytab files to the
/etc/hadoop/conf folder. You need to secure the keytab files (only the owner can see this file). So, you need to change
the owner to the service username accessing the keytab (e.g., if the HDFS process runs as user hdfs, then user hdfs
should own the keytab file) and set file permission 400. Please remember, the service principals for hdfs, mapred,
and http have a FQDN (fully qualified domain name) associated with the username. Also, service principals are host
specific and unique for each node.
[root@pract_hdp_sec]# sudo mv hdfs.keytab mapred.keytab /etc/hadoop/conf/
[root@pract_hdp_sec]# sudochown hdfs:hadoop /etc/hadoop/conf/hdfs.keytab
[root@pract_hdp_sec]# sudo chown mapred:hadoop /etc/hadoop/conf/mapred.keytab
[root@pract_hdp_sec]# sudo chmod 400 /etc/hadoop/conf/hdfs.keytab
[root@pract_hdp_sec]# sudo chmod 400 /etc/hadoop/conf/mapred.keytab
Implementing Kerberos for Hadoop
So far, I have installed and configured Kerberos and also created the database, principals, and keytab files. So, what’s
the next step for using this authentication for Hadoop? Well, I need to add the Kerberos setup information to relevant
Hadoop configuration files and also map the Kerberos principals set up earlier to operating systems users (since
operating system users will be used to actually run the Hadoop services). I will also need to assume that a Hadoop
cluster in a non-secured mode is configured and available. To summarize, configuring Hadoop for Kerberos will be
achieved in two stages:
Mapping service principals to their OS usernames
Adding information to various Hadoop configuration files
Mapping Service Principals to Their OS Usernames
Rules are used to map service principals to their respective OS usernames. These rules are specified in the Hadoop
configuration file core-site.xml as the value for the optional key hadoop.security.auth_to_local.
The default rule is simply named DEFAULT. It translates all principals in your default domain to their first
component. For example, hdfs@EXAMPLE.COM and hdfs/admin@EXAMPLE.COM both become hdfs, assuming your
default domain or realm is EXAMPLE.COM. So if the service principal and the OS username are the same, the default rule
is sufficient. If the two names are not identical, you have to create rules to do the mapping.
Each rule is divided into three parts: base, filter, and substitution. The base begins by specifying the number of
components in the principal name (excluding the realm), followed by a colon, and the pattern for building the username
from the sections of the principal name. In the pattern section $0 translates to the realm, $1 translates to the first
component, and $2 to the second component. So, for example, [2:$1] translates hdfs/admin@EXAMPLE.COM to hdfs.
The filter consists of a regular expression in parentheses that must match the generated string for the rule to
apply. For example, (.*@EXAMPLE.COM) matches any string that ends in @EXAMPLE.COM.
The substitution is a sed (popular Linux stream editor) rule that translates a regular expression into a fixed string.
For example: s/@[A-Z]*\.COM// removes the first instance of @ followed by an uppercase alphabetic name,
followed by .COM.
In my case, I am using the OS user hdfs to run the NameNode and DataNode services. So, if I had created
Kerberos principals nn/pract_hdp_sec@EXAMPLE.COM and dn/pract_hdp_sec@EXAMPLE.COM for use with Hadoop, then
I would need to map these principals to the OS user hdfs. The rule for this purpose would be:
RULE: [2:$1@$0] ([nd]n@.*EXAMPLE.COM) s/.*/hdfs/
65
CHAPTER 4 ■ OPEN SOURCE AUTHENTICATION IN HADOOP
Adding Information to Various Hadoop Configuration Files
To enable Kerberos to work with HDFS, you need to modify two configuration files:
core-site.xml
hdfs-site.xml
Table 4-1 shows modifications to properties within core-site.xml. Please remember to propagate these
changes to all the hosts in your cluster.
Table 4-1. Modifications to Properties in Hadoop Configuration File core-site.xml
Property Name Property Value Description
hadoop.security.authentication kerberos Set Authentication type for the cluster. Valid values are
simple (default) or Kerberos.
hadoop.security.authorization true Enable authorization for different protocols
hadoop.security.auth_to_local [2:$1]
DEFAULT
The mapping from Kerberos principal names to local OS
user names using the mapping rules
hadoop.rpc.protection privacy Possible values are authentication, integrity, and
privacy.
authentication = mutual client/server authentication
integrity = authentication and integrity; guarantees the
integrity of data exchanged between client and server as
well as authentication
privacy = authentication, integrity, and confidentiality;
encrypts data exchanged between client and server
The hdfs-site.xml configuration file specifies the keytab locations as well as principal names for various HDFS
daemons. Please remember, hdfs and http principals are specific to a particular node.
A Hadoop cluster may contain a large number of DataNodes, and it may be virtually impossible to configure the
principals manually for each of them. Therefore, Hadoop provides a _HOST variable that resolves to a fully qualified
domain name at runtime. This variable allows site XML to remain consistent throughout the cluster. However, please
note that _HOST variable can’t be used with all the Hadoop configuration files. For example, the jaas.conf file used by
Zookeeper (which provides resource synchronization across cluster nodes and can be used by applications to ensure
that tasks across the cluster are serialized or synchronized) and Hive doesn’t support the _HOST variable. Table 4-2
shows modifications to properties within hdfs-site.xml, some of which use the _HOST variable. Please remember to
propagate these changes to all the hosts in your cluster.
66
CHAPTER 4 ■ OPEN SOURCE AUTHENTICATION IN HADOOP
Table 4-2. Modified Properties for Hadoop Confguration File hdfs-site.xml
Property Name Property Value Description
dfs.block.access.token.enable True If true, access tokens are used for
accessing DataNodes
dfs.namenode.kerberos.
principal
dfs.secondary.namenode.
kerberos.principal
hdfs/_HOST@EXAMPLE.COM Kerberos principal name for the
NameNode
hdfs/_HOST @EXAMPLE.COM Address of secondary NameNode
webserver
*dfs.secondary.https.port 50490 The https port to which the
secondary NameNode binds
dfs.web.authentication.
kerberos.principal
dfs.namenode.kerberos.
internal.spnego.
principal
dfs.secondary.namenode.
kerberos.internal.
spnego.principal
HTTP/_HOST @EXAMPLE.COM The http Kerberos principal used
by Hadoop
HTTP/_HOST @EXAMPLE.COM This is the http principal
for the HTTP service
HTTP/_HOST @EXAMPLE.COM This is the http principal
for the http service
*dfs.secondary.http.address 192.168.142.135:50090 IP address of your secondary
NameNode host and port 50090
dfs.web.authentication.
kerberos.keytab
dfs.datanode.kerberos.
principal
/etc/hadoop/conf/spnego.service.keytab Kerberos keytab file with
credentials for http principal
hdfs/_HOST @EXAMPLE.COM The Kerberos principal that runs
the DataNode
dfs.namenode.keytab.file /etc/hadoop/conf/
hdfs.keytab
keytab file containing NameNode
service and host principals
dfs.secondary.namenode.
keytab.file
/etc/hadoop/conf/
hdfs.keytab
keytab file containing NameNode
service and host principals
dfs.datanode.keytab.file /etc/hadoop/conf/
hdfs.keytab
keytab file for DataNode
*dfs.https.port 50470 The https port to which the
NameNode binds
*dfs.https.address 192.168.142.135:50470 The https address for NameNode
(IP address of host + port 50470)
dfs.datanode.address 0.0.0.0:1019 The DataNode server address and
port for data transfer.
dfs.datanode.http.address 0.0.0.0:1022 The DataNode http server address
and port
*These values may change for your cluster
67
CHAPTER 4 ■ OPEN SOURCE AUTHENTICATION IN HADOOP
The files core-site.xml and hdfs-site.xml are included as downloads for your reference. They also contain
Kerberos-related properties set up for other components such as Hive, Oozie, and HBase.
MapReduce-Related Configurations
For MapReduce (version 1), the mapred-site.xml file needs to be configured to work with Kerberos. It needs to
specify the keytab file locations as well as principal names for the JobTracker and TaskTracker daemons. Use Table 4-3
as a guide, and remember that mapred principals are specific to a particular node.
Table 4-3. mapred Principals
Property Name Property Value Description
mapreduce.jobtracker.
kerberos.principal
mapreduce.jobtracker.
keytab.file
mapreduce.tasktracker.
kerberos.principal
mapreduce.tasktracker.
keytab.file
mapred/_HOST@EXAMPLE.COM mapred principal used to start
JobTracker daemon
/etc/hadoop/conf/mapred.keytab Location of the keytab file for the
mapred user
mapred/_HOST@EXAMPLE.COM mapred principal used to start
TaskTracker daemon
/etc/hadoop/conf/mapred.keytab Location of the keytab file for the
mapred user
mapred.task.tracker.
task-controller
org.apache.
hadoop.mapred.
LinuxTaskController
TaskController class used to launch
the child JVM
mapreduce.tasktracker.group mapred Group for running TaskTracker
mapreduce.jobhistory.keytab /etc/hadoop/conf/
mapred.keytab
Location of the keytab file for the
mapred user
mapreduce.jobhistory.principal mapred/_HOST@EXAMPLE.COM mapred principal used to start
JobHistory daemon
For YARN, the yarn-site.xml file needs to be configured for specifying the keytab and principal details;
Table 4-4 holds the details.
68
CHAPTER 4 ■ OPEN SOURCE AUTHENTICATION IN HADOOP
Table 4-4. YARN Principals
Property Name Property Value Description
yarn.resourcemanager.principal yarn/_HOST@EXAMPLE.COM yarn principal used to start
ResourceManager daemon
yarn.resourcemanager.keytab /etc/hadoop/conf/yarn.keytab Location of the keytab file for
the yarn user
yarn.nodemanager.principal yarn/_HOST@EXAMPLE.COM yarn principal used to start
NodeManager daemon
yarn.nodemanager.keytab /etc/hadoop/conf/yarn.keytab Location of the keytab file for
the yarn user
yarn.nodemanager.container-
executor.class
org.apache.hadoop.yarn.server.
nodemanager.
LinuxContainerExecutor
Executor class for launching
applications in yarn
yarn.nodemanager.linux-
containerexecutor.group
yarn Group for executing Linux
containers
For MapReduce (version 1), the TaskController class defines which Map or Reduce tasks are launched
(and controlled) and uses a configuration file called task-controller.cfg. This configuration file is present in the
Hadoop configuration folder (/etc/hadoop/conf/) and should have the configurations listed in Table 4-5.
Table 4-5. TaskController Configurations
Property Name Property Value Description
hadoop.log.dir /var/log/hadoop-
0.20-mapreduce
Hadoop log directory (will vary as per your Hadoop distribution).
This location is used to make sure that proper permissions exist
for writing to logfiles.
mapreduce.
tasktracker.group
mapred Group that the Task Tracker belongs to
banned.users mapred, hdfs,
and bin
Users who should be prevented from
running MapReduce
min.user.id 1000 User ID above which MapReduce tasks will be allowed to run
Here’s a sample task-controller.cfg:
hadoop.log.dir=/var/log/hadoop-0.20-mapreduce/
mapred.local.dir=/opt/hadoop/hdfs/mapred/local
mapreduce.tasktracker.group=mapred
banned.users=mapred,hdfs,bin
min.user.id=500
Please note that the value for min.user.id may change depending on the operating system. Some of the
operating systems use a value of 0 instead of 500.
For YARN, you need to define containerexecutor.cfg with the configurations in Table 4-6.
69
CHAPTER 4 ■ OPEN SOURCE AUTHENTICATION IN HADOOP
Table 4-6. YARN containerexecutor.cfg Configurations
Property Name Property Value Description
yarn.nodemanager.log-dirs /var/log/yarn Hadoop log directory (will vary as per your
Hadoop distribution). This location is used to
make sure that proper permissions exist for
writing to logfiles.
yarn.nodemanager.linux-
containerexecutor.group
yarn Group that the container belongs to
banned.users hdfs, yarn, mapred, and bin Users who should be prevented from running
MapReduce
min.user.id 1000 User ID above which MapReduce tasks will be
allowed to run
As a last step, you have to set the following variables on all DataNodes in file /etc/default/hadoop-hdfs-datanode.
These variables provide necessary information to Jsvc, a set of libraries and applications for making Java applications run
on Unix more easily, so it can run the DataNode in secure mode.
export HADOOP_SECURE_DN_USER=hdfs
export HADOOP_SECURE_DN_PID_DIR=/var/lib/hadoop-hdfs
export HADOOP_SECURE_DN_LOG_DIR=/var/log/hadoop-hdfs
export JSVC_HOME=/usr/lib/bigtop-utils/
If the directory /usr/lib/bigtop-utils doesn’t exist, set the JSVC_HOME variable to the /usr/libexec/
bigtop-utils as following:
export JSVC_HOME=/usr/libexec/bigtop-utils
So, finally, having installed, configured, and implemented Kerberos and modified various Hadoop configuration
files (with Kerberos implementation information), you are ready to start NameNode and DataNode services with
authentication!
Starting Hadoop Services with Authentication
Start the NameNode first. Execute the following command as root and substitute the correct path (to where your
Hadoop startup scripts are located):
su -l hdfs -c "export HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec && /usr/lib/hadoop/sbin/
hadoop-daemon.sh --config/etc/hadoop/conf start namenode";
After the NameNode starts, you can see Kerberos-related messages in NameNode log file indicating successful
authentication (for principals hdfs and http) using keytab files:
2013-12-10 14:47:22,605 INFO security.UserGroupInformation (UserGroupInformation.
java:loginUserFromKeytab(844)) - Login successful for user hdfs/pract_hdp_sec@EXAMPLE.COM using keytab file /etc/
hadoop/conf/hdfs.keytab
2013-12-10 14:47:24,288 INFO server.KerberosAuthenticationHandler (KerberosAuthenticationHandler.
java:init(185)) - Login using keytab /etc/hadoop/conf/hdfs.keytab, for principal HTTP/pract_hdp_sec@EXAMPLE.COM
70
CHAPTER 4 ■ OPEN SOURCE AUTHENTICATION IN HADOOP
Now start the DataNode: Execute the following command as root and substitute the correct path (to where your
Hadoop startup scripts are located):
su -l hdfs -c "export HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec &&
/usr/lib/hadoop/sbin/hadoop-daemon.sh --config /etc/hadoop/conf start datanode"
After the DataNode starts, you can see the following Kerberos-related messages in the DataNode log file
indicating successful authentication (for principal hdfs) using keytab file:
2013-12-08 10:34:33,791 INFO security.UserGroupInformation (UserGroupInformation.
java:loginUserFromKeytab(844)) - Login successful for user hdfs/pract_hdp_sec@EXAMPLE.COM using keytab file /etc/
hadoop/conf/hdfs.keytab
2013-12-08 10:34:34,587 INFO http.HttpServer (HttpServer.java:addGlobalFilter(525)) - Added global filter 'safety'
(class=org.apache.hadoop.http.HttpServer$QuotingInputFilter)
2013-12-08 10:34:35,502 INFO datanode.DataNode (BlockPoolManager.java:doRefreshNamenodes(193)) - Starting
BPOfferServices for nameservices: <default>
2013-12-08 10:34:35,554 INFO datanode.DataNode (BPServiceActor.java:run(658)) - Block pool <registering>
(storage id unknown) service to pract_hdp_sec/192.168.142.135:8020 starting to offer service
Last, start the SecondaryNameNode. Execute the following command as root and substitute the correct path
(to where your Hadoop startup scripts are located):
su -l hdfs -c "export HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec &&
/usr/lib/hadoop/sbin/hadoop-daemon.sh --config /etc/hadoop/conf start secondarynamenode";
Congratulations, you have successfully “kerberized” HDFS services! You can now start MapReduce services as
well (you have already set up the necessary principals and configuration in MapReduce configuration files).
Please understand that the commands I have used in this section may vary with the version of the operating
system (and the Hadoop distribution). It is always best to consult your operating system and Hadoop distributor’s
manual in case of any errors or unexpected behavior.
Securing Client-Server Communications
With earlier Hadoop versions, when daemons (or services) communicated with each other, they didn’t verify that the
other service is really what it claimed to be. So, it was easily possible to start a rogue TaskTracker to get access to data
blocks. Impersonating services could easily get access to sensitive data, destroy data, or bring the cluster down! Even
now, unless you have Kerberos installed and configured and also have the right communication protocols encrypted,
the situation is not very different. It is very important to secure inter-process communication for Hadoop. Just using
an authentication mechanism (like Kerberos) is not enough. You also have to secure all the means of communication
Hadoop uses to transfer data between its daemons as well as communication between clients and the Hadoop cluster.
Inter-node communication in Hadoop uses the RPC, TCP/IP, and HTTP protocols. Specifically, RPC (remote
procedure call) is used for communication between NameNode, JobTracker, DataNodes, and Hadoop clients. Also,
the actual reading and writing of file data between clients and DataNodes uses TCP/IP protocol, which is not secured
by default, leaving the communication open to attacks. Last, HTTP protocol is used for communication by web
consoles, for communication between NameNode/Secondary NameNode, and also for MapReduce shuffle data
transfers. This HTTP communication is also open to attacks unless secured.
Therefore, you must secure all these Hadoop communications in order to secure the data stored within a Hadoop
cluster. Your best option is to use encryption. Encrypted data can’t be used by malicious attackers unless they have
means of decrypting it. The method of encryption you employ depends on the protocol involved. To encrypt TCP/IP
communication, for example, an SASL wrapper is required on top of the Hadoop data transfer protocol to ensure secured
data transfer between the Hadoop client and DataNode. The current version of Hadoop allows network encryption (in
conjunction with Kerberos) by setting explicit values in configuration files core-site.xml and hdfs-site.xml. To secure
inter-process communications between Hadoop daemons, which use RPC protocol, you need to use SASL framework.
The next sections will take a closer look at encryption, starting with RPC-based communications.
71
CHAPTER 4 ■ OPEN SOURCE AUTHENTICATION IN HADOOP
Safe Inter-process Communication
Inter-process communication in Hadoop is achieved through RPC calls. That includes communication between a
Hadoop client and HDFS and also among Hadoop services (e.g., between JobTracker and TaskTrackers or NameNode
and DataNodes).
SASL (Simple Authentication and Security Layer) is the authentication framework that can be used to guarantee
that data exchanged between the client and servers is encrypted and not vulnerable to “man-in-the-middle” attacks
(please refer to Chapter 1 for details of this type of attack). SASL supports multiple authentication mechanisms (e.g.,
MD5-DIGEST, GSSAPI, SASL PLAIN, CRAM-MD5) that can be used for different contexts.
For example, if you are using Kerberos for authentication, then SASL uses a GSSAPI (Generic Security Service
Application Program Interface) mechanism to authenticate any communication between Hadoop clients and
Hadoop daemons. For a secure Hadoop client (authenticated using Kerberos) submitting jobs, delegation token
authentication is used, which is based on SASL MD5-DIGEST protocol. A client requests a token to NameNode and
passes on the received token to TaskTracker, and can use it for any subsequent communication with NameNode.
When you set the hadoop.rpc.protection property in Hadoop configuration file core-site.xml to privacy, the
data over RPC will be encrypted with symmetric keys. Here’s the XML:
<property>
<name>hadoop.rpc.protection</name>
<value>privacy</value>
<description>authentication, integrity & confidentiality guarantees that data exchanged between
client andserver is encrypted
</description>
</property>
Encryption comes at a price, however. As mentioned in Table 4-1, setting hadoop.rpc.protection to privacy
means Hadoop performs integrity checks, encryption, and authentication, and all of this additional processing will
degrade performance.
Encrypting HTTP Communication
Hadoop uses HTTP communication for web consoles, communication between NameNode/Secondary NameNode,
and for MapReduce (shuffle data). For a MapReduce job, the data moves between the Mappers and the Reducers via
the HTTP protocol in a process called a shuffle. The Reducer initiates a connection to the Mapper, requesting data,
and acts as a SSL client. The steps for enabling HTTPS to encrypt shuffle traffic are detailed next.
Certificates are used to secure the communication that uses HTTP protocol. You can use the Java utility keytool
to create and store certificates. Certificates are stored within KeyStores (files) and contain keys (private key and
identity) or certificates (public keys and identity). For additional details about KeyStores, please refer to Chapter 8 and
Appendix C. A TrustStore file contains certificates from trusted sources and is used by the secure HTTP (https) clients.
Hadoop HttpServer uses the KeyStore files.
After you create the HTTPS certificates and distribute them to all the nodes, you can configure Hadoop for
HTTP encryption. Specifically, you need to configure SSL on the NameNode and all DataNodes by setting property
dfs.https.enable to true in the Hadoop configuration file hdfs-site.xml.
Most of the time, SSL is configured to authenticate the server only, a mode called one-way SSL. For one-way
SSL, you only need to configure the KeyStore on the NameNode (and each DataNode), using the properties shown in
Table 4-7. These parameters are set in the ssl-server.xml file on the NameNode and each of the DataNodes.
72
CHAPTER 4 ■ OPEN SOURCE AUTHENTICATION IN HADOOP
Table 4-7. SSL Properties to Encrypt HTTP Communication
Property Default Value Description
ssl.server.keystore.type jks KeyStore file type
ssl.server.keystore.location NONE KeyStore file location. The mapred user should
own this file and have exclusive read access to it.
ssl.server.keystore.password NONE KeyStore file password
ssl.server.truststore.type jks TrustStore file type
ssl.server.truststore.location NONE TrustStore file location. The mapred user must be
file owner with exclusive read access.
ssl.server.truststore.password NONE TrustStore file password
ssl.server.truststore.reload.interval 10000 TrustStore reload interval, in milliseconds
You can also configure SSL to authenticate the client; this mode is called mutual authentication or two-way SSL. To
configure two-way SSL, set the property dfs.client.https.need-auth to true in the Hadoop configuration file hdfs-site.
xml (on the NameNode and each DataNode), in addition to setting the property dfs.https.enable to true.
Appendix C has details of setting up KeyStore and TrustStore to use for HTTP encryption.
To configure an encrypted shuffle, you need to set the properties listed in Table 4-8 in the core-site.xml
files of all nodes in the cluster.
Table 4-8. core-site.xml Properties for Enabling Encrypted Shuffle (for MapReduce)
Property Value Explanation
hadoop.ssl.enabled true For MRv1, setting this value to true enables both
the Encrypted Shuffle and the Encrypted Web UI
features. For MRv2, this property only enables the
Encrypted WebUI; Encrypted Shuffle is enabled
with a property in the mapred-site.xml file as
described in “Encrypting HTTP Communication.”
hadoop.ssl.require.
client.cert
hadoop.ssl.hostname.
verifier
true When set to true, client certificates are required
for all shuffle operations and all browsers used to
access Web UIs.
DEFAULT The hostname verifier to provide for
HttpsURLConnections. Valid values are DEFAULT, STRICT,
STRICT_I6, DEFAULT_AND_LOCALHOST, and ALLOW_ALL.
hadoop.ssl.keystores.
factory.class
org.apache.hadoop.security.
ssl.FileBasedKeyStoresFactory
The KeyStoresFactory implementation to use.
hadoop.ssl.server.conf ssl-server.xml Resource file from which SSL server KeyStore
information is extracted. This file is looked up in
the classpath; typically it should be in the
/etc/hadoop/conf/ directory.
hadoop.ssl.client.conf ssl-client.xml Resource file from which SSL server KeyStore
information is extracted. This file is looked up in
the classpath; typically it should be in the
/etc/hadoop/conf/ directory.
73
CHAPTER 4 ■ OPEN SOURCE AUTHENTICATION IN HADOOP
To enable Encrypted Shuffle for MRv2, set the property mapreduce.shuffle.ssl.enabled in the mapred-site.xml
file to true on every node in the cluster.
To summarize, for configuring Encrypted Shuffle (for MapReduce jobs) and Encrypted Web UIs, the following
configuration files need to be used/modified:
core-site.xml/hdfs-site.xml: for enabling HTTP encryption and defining implementation
mapred-site.xml: enabling Encrypted Shuffle for MRv2
ssl-server.xml: storing KeyStore and TrustStore settings for server
ssl-client.xml: storing KeyStore and TrustStore settings for the client
Securing Data Communication
Data transfer (read/write) between clients and DataNodes uses the Hadoop Data Transfer Protocol. Because the
SASL framework is not used here for authentication, a SASL handshake or wrapper is required if this data transfer
needs to be secured or encrypted. This wrapper can be enabled by setting the property dfs.encrypt.data.transfer to
true in configuration file hdfs-site.xml. When the SASL wrapper is enabled, a data encryption key is generated
by NameNode and communicated to DataNodes and the client. The client uses the key as a credential for any
subsequent communication. NameNode and DataNodes use it for verifying the client communication.
If you have a preference regarding the actual algorithm that you want to use for encryption, you can specify that
using the property dfs.encrypt.data.transfer.algorithm. The possible values are 3des or rc4 (default is usually
3DES.) 3DES, or “triple DES,” is a variation of the popular symmetric key algorithm DES that uses three keys (instead
of the single key DES uses) to add strength to the protocol. You encrypt with one key, decrypt with the second, and
encrypt with a third. This process gives a strength equivalent to a 112-bit key (instead of DES’s 56-bit key) and makes
the encryption stronger, but is slow (due to multiple iterations for encryption). Please refer to Chapter 8 for additional
details on DES protocol. RC4 is another symmetric key algorithm that performs encryption much faster as compared
to 3DES, but is potentially unsafe (Microsoft and Cisco are both phasing out this algorithm and have clear guidelines
to their users to avoid any usage of it).
Please note that since RPC protocol is used to send the Data Encryption Keys to the clients, it is necessary to
configure the hadoop.rpc.protection setting to privacy in the configuration file core-site.xml (for client and
server both), to ensure that the transfer of keys themselves is encrypted and secure.
Summary
In this chapter you learned how to establish overall security or a “fence” for your Hadoop cluster, starting with the
client. Currently, PuTTY offers the best open source options for securing your client. I discussed using a key pair
and passphrase instead of the familiar login/password alternative. The reason is simple—to make it harder for
malicious attacks to break through your security. Everyone has used PuTTY, but many times they don’t think about
the underlying technology and reason for using some of the available options. I have tried to shed some light on those
aspects of PuTTY.
I am not sure if MIT had Hadoop in mind when they developed Kerberos; but the current usage of Kerberos with
Hadoop might make you think otherwise! Again, it is (by far) the most popular alternative for Hadoop authentication.
Dealing with KeyStores and TrustStores is always a little harder for non-Java personnel. If you need another
example, Appendix C will help further your understanding those concepts.
The use of SASL protocol for RPC encryption and the underlying technology for encrypting data transfer protocol
are complex topics. This chapter’s example of implementing a secure cluster was merely intended to introduce the topic.
Where do you go from here? Is the job finished now that the outer perimeter of your cluster is secure? Certainly
not! This is where it begins—and it goes on to secure your cluster further by specifying finer details of authorization.
That’s the subject of the next chapter.
74
CHAPTER 5
Implementing Granular Authorization
Designing fine-grained authorization reminds me of a story of a renowned bank manager who was very disturbed
by a robbery attempt made on his safe deposit vault. The bank manager was so perturbed that he immediately
implemented multiple layers of security and passwords for the vault. The next day, a customer request required that
he open the vault. The manager, in all his excitement, forgot the combination, and the vault had to be forced open
(legally, of course).
As you may gather, designing fine-grained security is a tricky proposition. Too much security can be as
counterproductive as too little. There is no magic to getting it just right. If you analyze all your processes (both manual
and automated) and classify your data well, you can determine who needs access to which specific resources and
what level of access is required. That’s the definition of fine-grained authorization: every user has the correct level of
access to necessary resources. Fine-tuning Hadoop security to allow access driven by functional need will make your
Hadoop cluster less vulnerable to hackers and unauthorized access—without sacrificing usability.
In this chapter, you will learn how to determine security needs (based on application) and then examine ways to
design high-level security and fine-grained authorization for applications, using directory and file-level permissions.
To illustrate, I’ll walk you through a modified real-world example involving traffic ticket data and access to that data
by police, the courts, and reporting agencies. The chapter wraps up with a discussion of implementing fine-grained
authorization using Apache Sentry, revisiting the traffic ticket example to highlight Sentry usage with Hive, a database
that works with HDFS. By the end of this chapter, you will have a good understanding of how to design fine-grained
authorization.
Designing User Authorization
Defining the details of fine-grained authorization is a multistep process. Those steps are:
Analyze your environment,
Classify data for access,
Determine who needs access to what data,
Determine the level of necessary access, and
Implement your designed security model.
The following sections work through this complete process to define fine-grained authorization for a real-world
scenario.
75
CHAPTER 5 ■ IMPLEMENTING GRANULAR AUTHORIZATION
Call the Cops: A Real-World Security Example
I did some work for the Chicago Police Department a few years back involving the department’s ticketing system.
The system essentially has three parts: mobile consoles in police cars, a local database at the local police station,
and a central database at police headquarters in downtown Chicago. Why is fine-tuned authorization important in
this scenario? Consider the potential for abuse without it: if the IT department has modification permissions for the
data, for example, someone with a vested interest could modify data for a particular ticket. The original system was
developed using Microsoft SQL Server, but for my example, I will redesign it for a Hadoop environment. Along the
way, you’ll also learn how a Hadoop implementation is different from a relational database–based implementation.
Analyze and Classify Data
The first step is inspecting and analyzing the system (or application) involved. In addition, reviewing the high-level
objective and use cases for the system helps clarify access needs. Don’t forget maintenance, backup, and disaster
recovery when considering use cases. A system overview is a good starting point, as is reviewing the manual processes
involved (in their logical order). In both cases, your goals are to understand the functional requirements within each
process, to understand how processes interact with each other, to determine what data is generated within each
process and to track how that data is communicated to the next process. Figure 5-1 illustrates the analysis of a system.
Understanding how
processes interact
with each other
Understanding
functional
requirements within
each process
Reviewing the
processes involved
(and their logical
order)
How data is
communicated to the
next process?
What data is
generated within
each process?
Figure 5-1. Analyzing a system or an application
In my example, the first process is the generation of ticket data by a police officer (who issues the ticket). That
data gets transferred to the database at a local police station, and obviously needs to have modification rights for the
ticketing officer, his or her supervisor at the station, and of course upper management at police headquarters.
Other police officers at the local station need read permissions for this data, as they might want to have a look at
all the tickets issued on a particular day or at a person’s driving history while deciding whether to issue a ticket or only
a warning. Thus, a police officer looks up the ticket data (using the driver’s Social Security number, or SSN) at the local
police station database (for the current day) as well as at the central database located at police headquarters.
As a second process, the ticket data from local police stations (from all over the Chicago area) gets transmitted to
the central database at police headquarters on a nightly basis.
The third and final process is automated generation of daily reports every night for supervisors at all police
stations. These reports summarize the day’s ticketing activity and are run by a reporting user (created by IT).
76
Details of Ticket Data
CHAPTER 5 ■ IMPLEMENTING GRANULAR AUTHORIZATION
This ticket data is not a single entity, but rather a group of related entities that hold all the data. Understanding the
design of the database holding this data will help in designing a detailed level of security.
Two tables, or files in Hadoop terms, hold all the ticket data. Just as tables are used to store data in a relational
database, HDFS uses files. In this case, assume Hadoop stored the data as a comma-delimited text file. (Of course,
Hadoop supports a wide range of formats to store the data, but a simple example facilitates better understanding of
the concepts.) The table and file details are summarized in Figure 5-2.
Driver_details file in Hadoop
394-22-4567,Smith,John,203 Main street,Itasca,IL,8471234567
296-33-5563,Doe,Jane,1303 Pine sreet,Lombard,IL,6304561230
322-42-8765,Norberg,Scott,203 Main street,Lisle,IL,6304712345
Hadoop Data in files is not
“related”! It’s just text
data in two files!
113322,394-22-4567,Speeding 20 miles over limit,Robert Stevens
234765,296-33-5563,Uturn where prohibited,Mark Spencer
245676,322-42-8765,Crossed Red light,Chris Cross
Ticket_details file in Hadoop
If this data was
stored in relational
tables it would
have “keys” (values
making a row
unique) and
predefined
“relations” between
tables
Driver_details table
Social Sec Num (PK)
Last Name
First Name
Address line1
City
State
Phone number
A driver may
receive one
or many
tickets
Ticket Id (PK)
Driver SSN (FK)
Offense
IssuingOfficer
Ticket_details table
For Relational
storage,
“Ticket_details”
data would be
“related” with
“Driver_details”
data
Figure 5-2. Details of ticket data: classification of information and storage in tables versus files
The first table, Driver_details, holds all the personal details of the driver: full legal name, SSN, current address,
phone number, and so on. The second table, Ticket_details, has details of the ticket: ticket number, driver’s SSN,
offense code, issuing officer, and so forth.
Also, these tables are “related” to each other. The relational notation indicates the fact that every driver (featuring
in Driver_details) may have one or more tickets to his name, the details of which are in Ticket_details. How can a
ticket be related to a driver? By using the SSN. The SSN is a primary key (indicated as PK) or unique identifier for the
Driver_details table because an SSN identifies a driver uniquely. Since a driver may have multiple tickets, however,
the SSN is not a unique identifier for the Ticket_details table and is only used to relate the tickets to a driver
(indicated as FK or foreign key).
Please understand that the ticket data example is simplistic and just demonstrates how granular permissions can
be used. In addition, it makes these assumptions:
The example uses Hadoop 2.x, since we will need to append data to all our data files and
earlier versions didn’t support appends. All the day’s tickets from local police stations will be
appended every night to appropriate data files located at police headquarters.
Records won’t be updated, but a status flag will be used to indicate the active record
(the most recent record being flagged active and earlier ones inactive).
77
CHAPTER 5 ■ IMPLEMENTING GRANULAR AUTHORIZATION
There is no concurrent modification to records.
There are no failures while writing data that will compromise the integrity of the system.
The functional need is that only the police officer (who issues the ticket), that officer’s
supervisor, and higher management should have rights to modify a ticket—but this desired
granularity is not possible with HDFS! Hive or another NoSQL database needs to be used for
that purpose. For now, I have just provided modification rights to all police officers. In the next
section, however, you will learn how to reimplement this example using Hive and Sentry to
achieve the desired granularity.
A production system would likely be much more complex and need a more involved design for effective security.
Getting back to our example, how do we design roles for securing the ticket data and providing access based on
need? Our design must satisfy all of the functional needs (for all processes within the system) without providing too
much access (due to sensitivity of data). The next part of this section explains how.
Determine Access Groups and their Access Levels
Based on the functional requirements of the three processes, read and write (modify) access permissions are required
for the ticket data. The next question is, what groups require which permissions (Figure 5-3)? Three subgroups need
partial read and write access to this data; call them Group 1:
Ticket-issuing police officer
Local police supervisor
Higher management at headquarters
Group 1 Group 2
Read
Write
Access
Read
Access
Ticket Data
IT department
All
Police
Officers
Figure 5-3. Group access to ticket data with detailed access permissions
Group 2, the IT department at police headquarters, needs read access. Figure 5-3 illustrates this access.
Table 5-1 lists the permissions.
Table 5-1. Permission Details for Groups and Entities (Tables)
Table | Group 1 | Group 2 |
Driver_details | Read/write | Read |
Ticket_details | Read/write | Read |
78 |
CHAPTER 5 ■ IMPLEMENTING GRANULAR AUTHORIZATION
So, to summarize, analyze and classify your data, then determine the logical groups that need access to
appropriate parts of the data. With those insights in hand, you can design roles for defining fine-grained authorization
and determine the groups of permissions that are needed for these roles.
Logical design (even a very high-level example like the ticketing system) has to result in a physical
implementation. Only then can you have a working system. The next section focuses on details of implementing the
example’s design.
Implement the Security Model
Implementing a security model is a multistep process. Once you have a good understanding of the roles and their
permissions needs, you can begin. These are the steps to follow:
Secure data storage.
Create users and groups as necessary.
Assign ownerships, groups and permissions.
Understanding a few basic facts about Hadoop file permissions will help you in this process.
Ticket Data Storage in Hadoop
For the ticketing system example, I will start with implementation of data storage within HDFS. As you saw earlier,
data in Hadoop is stored in the files Driver_details and Ticket_details. These files are located within the root data
directory of Hadoop, as shown in Figure 5-4. To better understand the figure, consider some basic facts about HDFS
file permissions.

Permission groups:
first character
indicates
directory or file,
next 3 for owner,
then next 3 for
owner’s group and
last 3 for other
groups
Owner
Group
File name
Figure 5-4. HDFS directory and file permisssions
HDFS files have three sets of permissions, those for owner, group, and others. The permissions
are specified using a ten-character string, such as -rwxrwxrwx.
The first character indicates directory or file (- for file or d for directory), the next three
characters indicate permissions for the file’s owner, the next three for the owner’s group, and
last three for other groups.
Possible values for any grouping are r (read permission), w (write permission), x (permission
to execute), or – (a placeholder). Note that x is valid only for executable files or directories.
In Figure 5-4, the owner of the files Driver_details and ticket_details (root) has rw- permissions, meaning
read and write permissions. The next three characters are permissions for group (meaning all the users who belong
to the group this file is owned by, in this case hdfs). The permissions for group are rw-, indicating all group members
have read and write permissions for this file. The last three characters indicate permissions for others (users who
don’t own the file and are not a part of the same group this file is owned by). For this example, others have read
permissions only (r--).
79
CHAPTER 5 ■ IMPLEMENTING GRANULAR AUTHORIZATION
Adding Hadoop Users and Groups to Implement File Permissions
As a final step in implementing basic authorization for this system, I need to define appropriate users and groups
within Hadoop and adjust file permissions.
First, I create groups for this server corresponding to the example’s two groups: Group 1 is called POfficers and
Group 2 is ITD.
[root@sandbox ~]# groupadd POfficers
[root@sandbox ~]#groupadd ITD
Listing and verifying the groups is a good idea:
[root@sandbox ~]# cut –d: -f1 /etc/group | grep POfficers
I also create user Puser for group POfficers and user Iuser for group ITD:
[root]# useradd Puser –gPOfficers
[root]# useradd Iuser –gITD
Next, I set up passwords:
[root]# passwd Puser
Changing password for user Puser.
New password:
Retype password:
Passwd: all authentication tokens updated successfully.
Now, as a final step, I allocate owners and groups to implement the permissions. As you can see in Figure 5-5,
owners for the files Driver_details and Ticket_details are changed to the dummy user Puser, and group
permissions are set to write; so users from group POfficers (all police officers) will have read/write permissions and
users from other groups (viz. IT department) will have read permission only.

Figure 5-5. Changing owner and group for HDFS files
Comparing Table 5-1 to the final permissions for all the entities (same-named files in HDFS), you will see that
the objective has been achieved: Puser owns the files Driver_details and Ticket_details and belongs to group
POfficers (Group 1). The permissions -rw-rw-r-- indicate that any one from Group 1 has read/write permissions,
while users belonging to any other group (e.g., Group 2) only have read permissions.
This example gave you a basic idea about fine-tuning authorization for your Hadoop cluster. Unfortunately,
the real world is complex, and so are the systems we have to work with! So, to make things a little more real-world,
I’ll extend the example, and you can see what happens next to the ticket.
80
Extending Ticket Data
CHAPTER 5 ■ IMPLEMENTING GRANULAR AUTHORIZATION
Tickets only originate with the police. Eventually, the courts get involved to further process the ticket. Thus, some
of the ticket data needs to be shared with the judicial system. This group needs read as well as modifying rights on
certain parts of the data, but only after the ticket is processed through traffic court. In addition, certain parts of the
ticket data need to be shared with reporting agencies who provide this data to insurance companies, credit bureaus,
and other national entities as required.
These assumptions won’t change the basic groups, but will require two new ones: one for the judiciary (Group 3)
and another for reporting agencies (Group 4). Now the permissions structure looks like Figure 5-6.
Group 1 Group 2 Group 3 Group 4
Read
Write
Access
Read
Access
Read
Write
Access
Read
Access
Ticket Data
Reporting
agencies
Judiciary
IT department
All
Police
Officers
Figure 5-6. Group access to ticket data with detailed access permissions, showing new groups
With the added functionality and groups, data will have to be added as well. For example, the table
Judgement_details will contain the ticket’s judicial history, such as case date, final judgment, ticket payment details,
and more. Hadoop will store this table in a file by the same name (Figure 5-7).
If data was stored
in relational tables
it would have
‘keys’ (values
making a row
unique) and
predefined
‘relations’ between
tables
Driver_details table
Ticket_details table
A driver may receive
Social Sec Num (PK) one or many tickets Ticket Id (PK)
Last Name Driver SSN (FK)
First Name Offense
Address line1 Issuing Officer
City
State
Phone number A ticket may
result in a court
case
Judgment_details table
Judgment_details file stored in Hadoop
Case Id (PK)
Ticket Id (FK)
Driver SSN (FK)
Case Date
Judge
Judgment
TPayment details
‘Ticket_details’
data would be
‘related’ with
‘Driver_details’
data for relational
storage
‘Ticket_details’
data would also be
‘related’ with
‘Judgment_details’
data
23333,113322,394-22-4567,09/02/2014 9:00 AM,Robert Kennedy,Guilty,$200
12355,234765,296-33-5563,09/10/2014 11:00 AM,John Kennedy, Not Guilty,$0
58585,245676,322-42-8765,09/11/2014 10:00 AM,Mark Smith,Not Guilty,$0
Figure 5-7. Details of ticket data—classification of information—with added table for legal details
81
CHAPTER 5 ■ IMPLEMENTING GRANULAR AUTHORIZATION
Like Figure 5-2, Figure 5-7 also illustrates how data would be held in tables if a relational database was used
for storage. This is just to compare data storage in Hadoop with data storage in a relational database system. As I
discussed earlier, data stored within a relational database system is related: driver data (the Driver_details table)
is related to ticket data (the Ticket_details table) using SSN to relate or link the data. With the additional table
(Judgement_details), court judgment for a ticket is again related or linked with driver and ticket details using SSN.
Hadoop, as you know, uses files for data storage. So, as far as Hadoop is concerned, there is one additional data
file for storing data related to judiciary—Judgement_details. There is no concept of relating or linking data stored
within multiple files. You can, of course, link the data programmatically, but Hadoop doesn’t do that automatically for
you. It is important to understand this difference when you store data in HDFS.
The addition of a table will change the permissions structure as well, as you can see in Table 5-2.
Table 5-2. Permission Details for Groups and Entities
Entity (Table) | Group 1 | Group 2 | Group 3 | Group 4 |
Driver_details | Read/write | Read | Read | No access |
Ticket_details | Read/write | Read | Read | No access |
Judgement_details | Read | Read | Read/write | Read |
Adding new groups increases the permutations of possible permissions , but isn’t helpful in addressing complex
permission needs (please refer to the section “Role-Based Authorization with Apache Sentry” to learn about
implementing granular permissions). For example, what if the police department wanted only the ticket-issuing
officer and the station superintendent to have write permission for a ticket? The groups defined in Figure 5-7 and
Table 5-2 clearly could not be used to implement this requirement. For such complex needs, Hadoop provides access
control lists (ACLs), which are very similar to ACLs used by Unix and Linux.
Access Control Lists for HDFS
As per the HDFS permission model, for any file access request HDFS enforces permissions for the most specific user
class applicable. For example, if the requester is the file owner, then owner class permissions are checked. If the
requester is a member of group owning the file, then group class permissions are checked. If the requester is not a file
owner or member of the file owner’s group, then others class permissions are checked.
This permission model works well for most situations, but not all. For instance, if all police officers, the manager
of the IT department, and the system analyst responsible for managing the ticketing system need write permission
to the Ticket_details and Driver_details files, the four existing groups would not be sufficient to implement
these security requirements. You could create a new owner group called Ticket_modifiers, but keeping the group’s
membership up to date could be problematic due to personnel turnover (people changing jobs), as well as wrong or
inadequate permissions caused by manual errors or oversights.
Used for restricting access to data, ACLs provide a very good alternative in such situations where your permission
needs are complex and specific. Because HDFS uses the same (POSIX-based) permission model as Linux, HDFS
ACLs are modeled after POSIX ACLs that Unix and Linux have used for a long time. ACLs are available in Apache
Hadoop 2.4.0 as well as all the other major vendor distributions.
You can use the HDFS ACLs to define file permissions for specific users or groups in addition to the file’s owner
and group. ACL usage for a file does result in additional memory usage for NameNode, however, so your best practice
is to reserve ACLs for exceptional circumstances and use individual and group ownerships for regular security
implementation.
82
CHAPTER 5 ■ IMPLEMENTING GRANULAR AUTHORIZATION
To use ACLs, you must first enable them on the NameNode by adding the following configuration property to
hdfs-site.xml and restarting the NameNode:
<property>
<name>dfs.namenode.acls.enabled</name>
<value>true</value>
< /property>
Once you enable ACLs, two new commands are added to the HDFS CLI (command line interface): setfacl and
getfacl. The setfacl command assigns permissions. With it, I can set up write and read permissions for the ticketing
example’s IT Manager (ITMgr) and Analyst (ITAnalyst):
sudo -u hdfs hdfs dfs -setfacl -m user:ITMgr:rw- /Driver_details
sudo -u hdfs hdfs dfs -setfacl -m user:ITAnalyst:rw- /Driver_details
With getfacl, I can verify the permissions:
hdfs dfs -getfacl /Driver_details
# file:/Driver_details
# owner: Puser
# group: POfficers
user::rw-
user:ITAnalyst:rw-
user:ITMgr:rw-
group::r--
mask::rw-
other::r--
When ACL is enabled the file listing shows a + in permissions:
hdfs dfs -ls /Driver_details
-rw-rw-r--+ 1 Puser POfficers 19 2014-09-19 18:42 /Driver_details
You might have situations where specific permissions need to be applied to all the files in a directory or to all
the subdirectories and files for a directory. In such cases, you can specify a default ACL for a directory, which will be
automatically applied to all the newly created child files and subdirectories within that directory:
sudo -u hdfs hdfs dfs -setfacl -m default:group:POfficers:rwx /user
Verifying the permissions shows the default settings were applied:
hdfs dfs -getfacl /user
# file:/user
# owner: hdfs
# group:hdfs
user::rwx
group::r-x
other::r-x
83
CHAPTER 5 ■ IMPLEMENTING GRANULAR AUTHORIZATION
default:user::rwx
default:group::r-x
default:group:POfficers:rwx
default:mask::rwx
default:other::r-x
Note that in our simple example I left rw- access for all users from group POfficers, so the ACLs really do not
restrict anything. In a real-world application, I would most likely have restricted the group POfficers to have less
access (probably just read access) than the approved ACL-defined users.
Be aware that hdfs applies the default ACL only to newly created subdirectories or files; application of a default
ACL or subsequent changes to the default ACL of a parent directory are not automatically applied to the ACL of
existing subdirectories or files.
You can also use ACLs to block access to a directory or a file for a specific user without accidentally revoking
permissions for any other users. Suppose an analyst has been transferred to another department and therefore should
no longer have access to ticketing information:
sudo -u hdfs hdfs dfs -setfacl -m user:ITAnalyst:--- /Driver_details
Verify the changes:
hdfs dfs -getfacl /Driver_details
# file:/Driver_details
# owner: Puser
# group: POfficers
user::rw-
user:ITAnalyst:---
user:ITMgr:rw-
group::r--
mask::rw-
other::r--
The key to effectively using ACLs is to understand the order of evaluation for ACL entries when a user accesses a
HDFS file. The permissions are evaluated and enforced in the following order:
If the user owns the file, then the owner permissions are enforced.
If the user has an ACL entry, then those permissions are enforced.
If the user is a member of group (of file ownership), then those permissions are used.
If there is an ACL entry for a group and the user is a member of that group, then those
permissions are used.
If the user is a member of a file group or any other group with ACL entry denying access to
the file, then the user is denied access (to the file). If user is a member of multiple groups,
then union of permissions for all matching entries is enforced.
Last, if no other permissions are applicable, then permissions for the group others are used.
To summarize, HDFS ACLs are useful for implementing complex permission needs or to provide permissions
to a specific user or group different from the file ownership. Remember, however, to use ACLs judiciously, because
files with ACLs result in higher memory usage for NameNode. If you do plan to use ACLs, make sure to take this into
account when sizing your NameNode memory.
84
CHAPTER 5 ■ IMPLEMENTING GRANULAR AUTHORIZATION
Role-Based Authorization with Apache Sentry
Sentry is an application that provides role-based authorization for data stored in HDFS and was developed and
committed by Cloudera to the Apache Hadoop community. It provides granular authorization that’s very similar to that
of a relational database. As of this writing, Sentry is the most mature open source product that offers RBAC (role-based
authorization control) for data stored within HDFS, although another project committed by Hortonworks (Argus) is a
challenger. Sentry currently works in conjunction with Hive (database/data warehouse made available by the Apache
Software Foundation) and Impala (query engine developed by Cloudera and inspired by Google’s Dremel).
Hive Architecture and Authorization Issues
Hive is a database that works with HDFS. Its query language is syntactically very similar to SQL and is one of the
reasons for its popularity. The main aspects of the database to remember are the following:
Hive structures data into familiar database concepts such as tables, rows, columns,
and partitions.
Hive supports primitive data types: integers, floats, doubles, and strings.
Hive tables are HDFS directories (and files within).
Partitions (for a Hive table) are subdirectories within the “table” HDFS directory.
Hive privileges exist at the database or table level and can be granted to a user, group, or role.
Hive privileges are select (read), update (modify data), and alter (modify metadata).
Hive isn’t perfect, however. It uses a repository (Metastore) for storing metadata related to tables, so you can
potentially have a mismatch between table metadata and HDFS permissions if permissions for underlying HDFS
objects are changed directly. Hive doesn’t have the capability to prevent or identify such a situation. Therefore, it’s
possible that a user is granted select permissions on a table, but has update or write permissions on the corresponding
directory/files within HDFS, through the user’s operating system user or group. Also, Hive has no way of providing
permissions for specific parts of table data or partial table data. There is no way to provide column-level permissions,
define views (for finer data access control), or define server level roles.
Sentry addresses some of these issues. It provides roles at the server, database, and table level and can work with
Hive external tables —which you can use for partial data access control for users.
Figure 5-8 illustrates Hive’s architecture and where it fits in respect to HDFS.
Hive
Thrift
Server
Database 1
Database 2
Metastore
Driver
HDFS
Table2
Table4
View1
View2
Table3
Table1
User1
select on
Table1
User2
select on
Table3
User3
Update
on Table2
JDBC
ODBC
Web
Interface
Hive
shell
![]()
Figure 5-8. Hive architecture and its authorization
85
CHAPTER 5 ■ IMPLEMENTING GRANULAR AUTHORIZATION
Sentry Architecture
A security module that integrates with Hive and Impala, Sentry offers advanced authorization controls that enable
more secure access to HDFS data. We will focus on Sentry integration with Hive (since it is used more extensively).
Sentry uses rules to specify precise permissions for a database object and roles to combine or consolidate the rules,
thereby making it easy to group permissions for different database objects while offering flexibility to create rules for
various types of permissions (such as select or insert).
Creating Rules and Roles for Sentry
Sentry grants precise control to specify user access to subsets of data within a database or a schema or a table using a
rule. For example, if a database db1 has table called Employee, then a rule providing access for Insert can be:
server=MyServer->db=db1->table=Employee->action=Insert
A role is a set of rules to access Hive objects. Individual rules are comma separated and grouped to form a role.
For example, the Employee_Maint role can be specified as:
Employee_Maint = server=Myserver->db=db1->table=Employee->action=Insert, \
server=server1->db=db1->table=Employee_Dept->action=Insert,\
server=server1->db=db1->table=Employee_salary->action=Insert
Here, the Employee_Maint role enables any user (who has the role) to insert rows within tables Employee,
Employee_Dept, and Employee_salary.
Role-based authorization simplifies managing permissions since administrators can create templates for
groupings of privileges based on functional roles within their organizations.
Multidepartment administration empowers central administrators to deputize individual administrators
to manage security settings for each separate database or schema using database-level roles. For example, in
the following code, the DB2_Admin role authorizes all permissions for database db2 and Svr_Admin authorizes all
permissions for server MyServer:
DB2_Admin = server=MyServer->db=db2
Svr_Admin =server=MyServer
Creating rules and roles within Sentry is only the first step. Roles need to be assigned to users and groups if you
want to use them. How does Sentry identify users and groups? The next section explains this.
Understanding Users and Groups within Sentry
A user is someone authenticated by the authentication subsystem and permitted to access the Hive service. Because
the example assumes Kerberos is being used, a user will be a Kerberos principal. A group is a set of one or more users
that have been granted one or more authorization roles. Sentry currently supports HDFS-backed groups and locally
configured groups (in the configuration file policy.xml). For example, consider the following entry in policy.xml:
Supervisor = Employee_Maint, DB2_Admin
86
CHAPTER 5 ■ IMPLEMENTING GRANULAR AUTHORIZATION
If Supervisor is a HDFS-backed group, then all the users belonging to this group can execute any HiveQL
statements permitted by the roles Employee_Maint and DB2_Admin. However, if Supervisor is a local group, then users
belonging to this group (call them ARoberts and MHolding) have to be defined in the file policy.xml:
[users]
ARoberts = Supervisor
MHolding =Supervisor
Figure 5-9 demonstrates where Sentry fits in the Hadoop architecture with Kerberos authentication.
Sentry
Client
Request s access
using Kerberos
credentials
Hive shell or JDBC /
ODBC application or
Thrift client
Sentry checks if underlying user has access and compares user’s role
based access with HiveQL statements requested. If access checks,
HiveQL statements are executed
K
e
b
r Hive
Sentry
provides
granular
Authorization
Operating system
e
o
r HDFS
s
- Kerberos
provides
Authentication
Authentication
Figure 5-9. Hadoop authorization with Sentry
To summarize, after reviewing Hive and Sentry architectures, you gained an understanding of the scope of
security that each offers. You had a brief look at setting up rules, roles, users, and groups. So, you are now ready to
reimplement the ticketing system (using Sentry) defined in the earlier sections of this chapter.
Implementing Roles
Before reimplementing the ticketing system with the appropriate rules, roles, users, and groups, take a moment to
review its functional requirements. A ticket is created by the police officer who issues the ticket. Ticket data is stored
in a database at a local police station and needs to have modification rights for all police officers. The IT department
located at police headquarters needs read permission on this data for reporting purposes. Some of the ticket data is
shared by the judicial system, and they need read as well as modifying rights to parts of data, because data is modified
after a ticket is processed through traffic court. Last, certain parts of this data need to be shared with reporting
agencies that provide this data to insurance companies, credit bureaus, and other national agencies as required.
Table 5-3 summarizes the requirements; for additional detail, consult Figure 5-7.
87
CHAPTER 5 ■ IMPLEMENTING GRANULAR AUTHORIZATION
Table 5-3. Permission Details for Groups and Entities
Entity (Table) | Police Officers | IT Department | Judiciary | Reporting Agencies |
Driver_details | Read/write | Read | Read | No access |
Ticket_details | Read/write | Read | Read | No access |
Judgement_details | Read | Read | Read/write | Read |
The original implementation using HDFS file permissions was easy but did not consider the following issues:
When a ticket gets created, a judiciary record (a case) is created automatically with the
parent ticket_id (indicating what ticket this case is for) and case details. The police officer
should have rights to insert this record in the Judgement_details table with ticket details, but
shouldn’t be allowed to modify columns for judgment and other case details. File permissions
aren’t flexible enough to implement this requirement.
The judge (assigned for a case) should have modification rights for columns with case details,
but shouldn’t have modification rights to columns with ticket details. Again, file permissions
can’t handle this.
To implement these requirements, you need Sentry (or its equivalent). Then, using Hive, you need to create
external tables with relevant columns (the columns where judiciary staff or police officers need write access) and
provide write access for the appropriate departments to those external tables instead of Ticket_details and
Judgement_details tables.
For this example, assume that the cluster (used for implementation) is running CDH4.3.0 (Cloudera Hadoop
distribution 4.3.0) or later and has HiveServer2 with Kerberos authentication installed.
As a first step, you need to make a few configuration changes. Change ownership of the Hive warehouse directory
(/user/hive/warehouse or any other path specified as value for property hive.metastore.warehouse.dir in Hive
configuration file hive-site.xml) to the user hive and group hive. Set permissions on the warehouse directory as
770 (rwxrwx---), meaning read, write, and execute permissions for owner and group; but no permissions for others
or users not belonging to the group hive. You can set the property hive.warehouse.subdir.inherit.perms to true
in hive-site.xml, to make sure that permissions on the subdirectories will be set to 770 as well. Next, change the
property hive.server2.enable.doAs to false. This will execute all queries as the user running service Hiveserver2.
Last, set the property min.user.id to 0 in configuration file taskcontroller.cfg. This is to ensure that the hive user
can submit MapReduce jobs.
Having made these configuration changes, you’re ready to design the necessary tables, rules, roles, users,
and groups.
Designing Tables
You will need to create the tables Driver_details, Ticket_details, and Judgement_details, as well as an external
table, Judgement_details_PO, as follows:
CREATE TABLE Driver_details (SocialSecNum STRING,
Last NameSTRING,
First Name STRING,
Address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>,
Phone BIGINT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\;'
LOCATION "/Driver_details";
88
CHAPTER 5 ■ IMPLEMENTING GRANULAR AUTHORIZATION
CREATE TABLE Ticket_details (TicketId BIGINT,
DriverSSN STRING,
Offense STRING,
Issuing Officer STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\;'
LOCATION "/Ticket_details";
CREATE TABLE Judgement_details (CaseID BIGINT,
TicketId BIGINT,
DriverSSN STRING,
CaseDate STRING,
Judge STRING,
Judgement STRING,
TPaymentDetails STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\;'
LOCATION "/Judgement_details";
CREATE EXTERNAL TABLE Judgement_details_PO (CaseID BIGINT,
TicketId BIGINT,
DriverSSN STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\;'
LOCATION "/user/hive/warehouse/Judgement_details";
If you refer to Figure 5-5, you will observe that I am using the same columns (as we have in Hadoop files or tables)
to create these tables and just substituting the data type as necessary (e.g., the Last Name is a character string, or
data type STRING; but TicketId is a big integer or BIGINT). The last table, Judgement_details_PO, is created as a Hive
external table, meaning Hive only manages metadata for this table and not the actual datafile. I created this table as
an external table with the first two columns of the table Judgement_details because I need certain resources to have
permissions to modify these two columns only—not the other columns in that table.
Designing Rules
I need to design rules to provide the security required to implement the ticketing system. The example has four tables,
and various roles are going to need Read (Select) or Modify (Insert) rights, because there are no “updates” for Hive or
HDFS data. I will simply append (or Insert) the new version of a record. So, here are the rules:
server=MyServer->db=db1->table=Driver_details->action=Insert
server=MyServer->db=db1->table=Ticket_details->action=Insert
server=MyServer->db=db1->table=Judgement_details->action=Insert
server=MyServer->db=db1->table=Judgement_details_PO->action=Insert
server=MyServer->db=db1->table=Driver_details->action=Select
server=MyServer->db=db1->table=Ticket_details->action=Select
server=MyServer->db=db1->table=Judgement_details->action=Select
These rules simply perform Select or Modify actions for all the tables.
89
CHAPTER 5 ■ IMPLEMENTING GRANULAR AUTHORIZATION
Designing Roles
Let’s design roles using the rules we created. The first role is for all police officers:
PO_role = server=Myserver->db=db1->table= Driver_details ->action=Insert, \
server=MyServer->db=db1->table= Driver_details->action=Select, \
server=MyServer->db=db1->table= Ticket_details ->action=Insert, \
server=MyServer->db=db1->table= Ticket_details ->action=Select, \
server=MyServer->db=db1->table= Judgement_details ->action=Select, \
server=MyServer->db=db1->table= Judgement_details_PO ->action=Insert
Notice that this role allows all the police officers to have read/write permissions to tables Driver_details and
Ticket_details but only read permission to Judgement_details. The reason is that police officers shouldn’t have
permission to change the details of judgment. You will also observe that police officers have write permission to
Judgement_details_PO and that is to correct the first two columns (that don’t have any judicial information)—in case
there is any error!
The next role is for employees working at the IT department:
IT_role = server=MyServer->db=db1->table= Driver_details ->action=Select, \
server=MyServer->db=db1->table= Ticket_details->action=Select, \
server=MyServer->db=db1->table= Judgement_details ->action=Select
The IT employees have only read permissions on all the tables because they are not allowed to modify any data.
The role for Judiciary is as follows:
JU_role = server=MyServer->db=db1->table= Judgement_details ->action=Insert, \
server=MyServer->db=db1->table= Driver_details->action=Select, \
server=MyServer->db=db1->table= Ticket_details ->action=Select
The judiciary has read permissions for driver and ticket data (because they are not supposed to modify it) but
write permission to enter the judicial data because only they are allowed to modify it.
Last, for the Reporting agencies the role is simple:
RP_role = server=MyServer->db=db1->table=Judgement_details->action=Select
The Reporting agencies have read permissions on the Judgement_details table only because they are allowed to
report the judgement. All other data is confidential and they don’t have any permissions on it.
Setting Up Configuration Files
I have to set up the various configuration files for Sentry to incorporate the roles that we have set up earlier. The
first file is sentry-provider.ini and that defines per-database policy files (with their locations), any server level or
database level roles, and Hadoop groups with their assigned (server level or database level) roles. Here’s how
sentry-provider.ini will look for our example:
[databases]
# Defines the location of the per DB policy file for the db1 DB/schema
db1 =hdfs://Master:8020/etc/sentry/customers.ini
90
CHAPTER 5 ■ IMPLEMENTING GRANULAR AUTHORIZATION
[groups]
# Assigns each Hadoop group to its set of roles
db1_admin =db1_admin_role
admin = admin_role
[roles]
# Implies everything on MyServer -> db1. Privileges for
# db1 can be defined in the global policy file even though
# db1 has its only policy file. Note that the Privileges from
# boththe global policy file and the per-DB policy file
# are merged. There is no overriding.
db1_admin_role =server=MyServer->db=db1
# Implies everything on server1
admin_role =server=MyServer
In the example’s case, there is a specific policy file for database db1 (customers.ini) and is defined with its
location. Administrator roles for the server and database db1 are defined (admin_role, db1_admin_role). Appropriate
Hadoop groups (db1_admin, admin) are assigned to those administrator roles.
The next file is db1.ini. It is the per-database policy file for database db1:
[groups]
POfficers = PO_role
ITD =IT_role
Judiciary = JU_role
Reporting = RP_role
[roles]
PO_role = server=MyServer->db=db1->table= Driver_details ->action=Insert, \
server=MyServer->db=db1->table= Driver_details->action=Select, \
server=MyServer->db=db1->table= Ticket_details ->action=Insert, \
server=MyServer->db=db1->table= Ticket_details ->action=Select, \
server=MyServer->db=db1->table= Judgement_details ->action=Select, \
server=MyServer->db=db1->table= Judgement_details_PO ->action=Insert
IT_role = server=MyServer->db=db1->table= Driver_details ->action=Select, \
server=MyServer->db=db1->table= Ticket_details->action=Select, \
server=MyServer->db=db1->table= Judgement_details ->action=Select
JU_role = server=MyServer->db=db1->table= Judgement_details ->action=Insert, \
server=MyServer->db=db1->table= Driver_details->action=Select, \
server=MyServer->db=db1->table= Ticket_details ->action=Select
RP_role = server=MyServer->db=db1->table=Judgement_details->action=Select
91
CHAPTER 5 ■ IMPLEMENTING GRANULAR AUTHORIZATION
Notice above that I have defined all the roles (designed earlier) in the roles section. The groups section maps
Hadoop groups to the defined roles. Now, I previously set up Hadoop groups POfficers and ITD. I will need to set up
two additional groups (Judiciary and Reporting) because I mapped roles to them in db1.ini file.
The last step is setting up Sentry configuration file sentry-site.xml:
<configuration>
<property>
<name>hive.sentry.provider</name>
<value>org.apache.sentry.provider.file.HadoopGroupResourceAuthorizationProvider</value>
</property>
<property>
<name>hive.sentry.provider.resource</name>
<value>hdfs://Master:8020/etc/sentry/authz-provider.ini</value>
</property>
<property>
<name>hive.sentry.server</name>
<value>Myserver</value>
</property>
</configuration>
Last, to enable Sentry, we need to add the following properties to hive-site.xml:
<property>
<name>hive.server2.session.hook</name>
<value>org.apache.sentry.binding.hive.HiveAuthzBindingSessionHook</value>
</property>
<property>
<name>hive.sentry.conf.url</name>
<value>hdfs://Master:8020/etc/sentry-site.xml</value>
</property>
This concludes reimplementation of the ticketing system example using Apache Sentry. It was possible to
specify the correct level of authorization for our ticketing system because Sentry allows us to define rules and roles
that limit access to data as necessary. Without this flexibility, either too much access would be assigned or no access
would be possible.
92
Summary
CHAPTER 5 ■ IMPLEMENTING GRANULAR AUTHORIZATION
One of the few applications that offers role-based authorization for Hadoop data, Sentry is a relatively new release
and still in its nascent state. Even so, it offers a good start in implementing role-based security, albeit nowhere close to
the type of security an established relational database technology offers. True, Sentry has a long way to go in offering
anything comparable to Oracle or Microsoft SQL Server, but currently it’s one of the few options available. That’s also
the reason why the best practice is to supplement Sentry capabilities with some of Hive’s features!
You can use Hive to supplement and extend Sentry’s functionality. For example, in the ticketing example, I used
the external table feature of Hive to create a role that provided write permission on only some columns of the table.
Sentry by itself is not capable of offering partial write permission on a table, but you can use it in combination with
Hive to offer such a permission. I encourage you to study other useful Hive features and create your own roles that
can extend Sentry’s functionality. The Apache documentation at https://cwiki.apache.org/confluence/display/
Hive/LanguageManual+DDL provides many useful suggestions
Last, the chapter’s ticketing example proved that you can provide partial data access (number of columns starting
from first column) to a role by defining an external table in Hive. Interestingly, you can’t provide access to only some
columns (e.g., columns four to eight) for a table using Sentry. Of course, there are other ways of implementing such a
request using features that Hive provides!
93
PART III
Audit Logging and Security
Monitoring
CHAPTER 6
Hadoop Logs: Relating and
Interpretation
The other day, a very annoyed director of business intelligence (at a client site) stormed into my office and
complained about one of the contractors deleting some ledger records from a production server. She had received a
daily summary audit log report that showed 300 ledger records (financial transaction entries) had been deleted! To
start with, the contractor in question shouldn’t have had access to them. So I investigated, and it turned out that the
ERP (Enterprise resource planning) software that client was using had a bug that provided access through the “Public”
role. I wouldn’t have discovered the bug if I didn’t have audit logging enabled, which proves how important audit
logging can be from a security perspective.
The purpose of HDFS audit logging is to record all HDFS access activity within Hadoop. A MapReduce audit
log has entries for all jobs submitted. In addition, the Hadoop daemon log files contain startup messages, internal
diagnostic information, errors, informational or warning messages, configuration logs, and so forth. You can
filter the information that’s not required later, but it’s helpful to log all access, including authorized access. Even
authorized users can perform tasks for which they are not authorized. For example, a police officer might perform an
unauthorized update to his girlfriend’s ticket record without appropriate approvals. Besides, for audited applications
or any SOX-compliant applications, it is mandatory to audit all access to data objects (e.g., tables) within an
application, as well as to audit all job activity that changes any data within an audited application.
In this chapter, I will discuss how to enable auditing for Hadoop and how to capture auditing data. Log4j is at the
heart of Hadoop logging, be it audit logs or Hadoop daemon logs. I will begin with a high-level discussion of the Log4j
API and how to use it for audit logging, and then discuss the Log4j logging levels and their purpose. After an overview
of daemon logs and the information they capture, you will learn how to correlate auditing with Hadoop daemon logs
to implement security effectively.
Using Log4j API
A Java-based utility or framework, Apache Log4j was created by by Ceki Gülcü and has since become a project of the
Apache Software Foundation. Logging is an essential part of any development cycle and in the absence of a debugger
(which is usually the case), it is the only tool for troubleshooting application code. It’s very important to use the
correct type of logging—one that’s reliable, fast, and flexible. Log4j fulfills these requirements:
Reliability is the expectation that relevant error or status messages are displayed without
any exceptions. Custom logging routines can be prone to bugs in that some of the messages
are not displayed due to faulty logic. Log4j doesn’t have that problem. This logging system is
well tested and has been popular for a long time. Reliability of the logging output logic can
certainly be guaranteed with Log4j.
97
CHAPTER 6 ■ HADOOP LOGS: RELATING AND INTERPRETATION
Speed refers to the response time of logging routine used. With Log4j, the Logger class is
instantiated (an instance created) as opposed to interacting with an interface, resulting in
a superfast response. Deciding what to log (based on logging level) only involves a decision
based on Logger hierarchy, which is fast. Outputting of a log message is fast due to use of
preformatting using Layouts and Appenders; typically, actual logging is about 100 to
300 microseconds. With SimpleLayout (the simplest Layout option for Log4j, explained in the
“Layout” section), Log4j can log as quickly as a print statement (which simply prints input text
to a console or a file)!
Flexibility refers to the ease of change to a logging system (without modifying the application
binaries that use it) and the ease of use for the application using the modified logging. For
example, with Log4j, you can direct output to two destinations, like the console and a log file,
using multiple logging destinations, which are also called Appenders. Simply modify the log4j.
properties configuration file to make this change; no code changes are needed.
The easiest way to include status or error messages is, of course, to insert them directly in your code. So, what’s
the advantage of using Log4j for logging as opposed to inserting comments in your application code or using a custom
logging module? Well, inserting comments and removing them is a tedious and time-consuming process that relies
on the expertise of the programmer—who might just forget to remove them after testing. Getting the percentage of
comments correct (sometimes too many, other times too few) is difficult, and selectively displaying those comments
is impossible. Also, any changes to comments involve recompilation of code. Last, a custom logging module may have
bugs or may not have as extensive functionality as Log4j API.
Via a configuration file, Log4j allows you to set logging behavior at runtime without modifying application
binaries. A major concern with logging is its impact on performance. Any logging by nature slows down an
application, but with Log4j the impact on performance is minuscule. For example, an independent test of the
latest release of Log4j 2 (Version 2) showed that it can output up to 18 million messages per second (for full
results, see Christian Grobmeier, “Log4j”: Performance Close to Insane,” www.javacodegeeks.com/2013/07/
Log4j-2-performance-close-to-insane.html). With Log4j the impact is limited to a range from nanoseconds to
microseconds, depending on your Log4j configuration, logging level, and Appenders.
The main components of Log4j logging framework are the Logger, Appender, Layout, and Filters. So you can
better understand how they work together, Figure 6-1 illustrates where they fit within the framework.
98
CHAPTER 6 ■ HADOOP LOGS: RELATING AND INTERPRETATION
Requests a
specific
logger
Application
Logger created and
level assigned (based on
allocation or
hierarchy)
Logger
Context-wide applies its
filter evaluates level for
event before event
passing it to a filtering
logger
Logger
Please note that appender
and layout names used are
examples. You can use any file
destinationas an appender
and use any layout with it
Output
directed to
appender(s)
using
predefined
layouts
Logger and
Appender
filters
applied
Appender1
PatternLayout
Appender2
SimpleLayout
Appender3
DateLayout
![]()
![]()
(hdfs-audit.log)
(Console)
(mapred-audit.log)
Figure 6-1. Log4j framework and its main components
The sections that follow will discuss each of these components in detail and provide information about what they
do and what their exact role is within the framework.
Loggers
A Logger is a named entity that is associated with a configuration (LoggerConfig) and subsequently with a logging
level. For Log4j logging to function, you need to have a root Logger with a related configuration defined. The root
Logger defines the default configuration (Appender, Layout, etc.). So what are these logging levels and how do they
correlate?
Logging Levels for Log4j
There are seven logging levels for Log4j API. They log information in order of severity and each of the levels is
inclusive of all higher levels. For example, log level INFO includes informational messages, warnings (higher-level WARN
included), nonfatal errors (higher-level ERROR included) and fatal errors (higher-level FATAL included). Similarly, log
level WARN includes warnings, nonfatal errors, and fatal errors. Figure 6-2 summarizes these inclusions.
99
CHAPTER 6 ■ HADOOP LOGS: RELATING AND INTERPRETATION
Event Level | Logger Configuration level | |||||
TRACE | DEBUG | INFO | WARN | ERROR | FATAL | |
TRACE | YES | NO | NO | NO | NO | NO |
DEBUG | YES | YES | NO | NO | NO | NO |
INFO | YES | YES | YES | NO | NO | NO |
WARN | YES | YES | YES | YES | NO | NO |
ERROR | YES | YES | YES | YES | YES | NO |
FATAL | YES | YES | YES | YES | YES | YES |
Row data shows
possible logging
levels that can
be used with
Hadoop
daemons or
services (such
as NameNode,
JobTracker
etc.)
Figure 6-2. Log4j logging levels and inclusions
Column data
shows level of
messages you
will actually
get in your log
files for a
Hadoop
daemon’s
configured
level
The seven log levels are as follows:
ALL: This is the lowest possible logging level, and it logs all messages including the higher
levels (e.g., fatal errors, nonfatal errors, informational messages, etc.)
TRACE: As the name suggests, this level logs finer-grained informational events than the
DEBUG level.
DEBUG: Logs fine-grained informational events that are most useful to debug an application.
INFO: Logs informational messages that highlight the progress of the application at a more
coarse-grained level.
WARN: Logs potentially harmful situations.
ERROR: Logs error events that might still allow the application to continue running.
FATAL: Logs very severe error events that will presumably lead the application to abort.
Please note that enabled TRACE and DEBUG levels can be considered a serious security flaw in production systems
and may be reported by vulnerability scanners as such. So, please use these log levels only when troubleshooting
issues and make sure that they are disabled immediately afterward.
Logger Inheritance
Logger names are case-sensitive and named hierarchically. A Logger is said to be an ancestor of another Logger if its
name followed by a dot is a prefix of the descendant Logger name. A Logger is said to be a parent of a child Logger if
there are no ancestors between it and the descendant Logger. So, for example, the Logger named L1.L2 is parent of the
Logger named L1.L2.L3. Also, L1 is parent of L1.L2 and ancestor (think grandparent) of L1.L2.L3. The root Logger is at
the top of the Logger hierarchy.
A Logger can be assigned a default log level. If a level is not assigned to a Logger, then it inherits one from its
closest ancestor with an assigned level. The inherited level for a given Logger L1 is equal to the first non-null level
in the Logger hierarchy, starting at L1 and proceeding upward in the hierarchy toward the root Logger. To make sure
that all Loggers inherit a level, the root Logger always has an assigned level. Figure 6-3 contains an example of level
inheritance.
100
CHAPTER 6 ■ HADOOP LOGS: RELATING AND INTERPRETATION
Logger Name | Assigned level | Inherited level |
Root | Lroot | Lroot |
L1 | L1 | L1 |
L1.L2 | None | L1 |
L1.L2.L3 | L3 | L3 |
Figure 6-3. Logger level inheritance
As you can see, the root Logger, L1, and L1.L2.L3 have assigned logging levels. The Logger L1.L2 has no logging
level assigned to it and inherits the logging level L1 from its parent L1. A logging request is said to be enabled if its
level is higher than or equal to the level of its Logger. Otherwise, the request is disabled.
Most Hadoop distributions have five standard Loggers defined in log4j.properties in the /etc/Hadoop/conf
or $HADOOP_INSTALL/hadoop/conf directories (Figure 6-4). For Log4j logging to function, a root Logger (with related
configuration) must be defined. The security Logger logs the security audit information. Audit Loggers log HDFS and
MapReduce auditing information, while a job summary Logger logs summarized information about MapReduce jobs.
Some distributions also have Loggers defined for Hadoop metrics, JobTracker, or TaskTracker.
Logger Name | Log4j property | Default log level |
Root logger | hadoop.root.logger | INFO |
Security logger | hadoop.security.logger | INFO |
HDFS Audit logger | hdfs.audit.logger | WARN |
MapReduce audit logger | mapred.audit.logger | WARN |
Job summary logger | hadoop.mapreduce.jobsummary.logger | INFO |
Figure 6-4. Loggers and default log levels
Figure 6-5 is a sample entry for HDFS audit Logger from log4j.properties.

Figure 6-5. HDFS Audit Logger
The maxfilesize setting is the critical size (here 256MB) after which the log file will “roll” and create a new log
file; maxbackupindex (20 in this case) is the number of backup copies of the log file to be created. In this example,
when the log file rolls over 21 times, the oldest file will be erased. Properties of other Loggers are specified in a similar
manner in the log4j.properties file.
101
CHAPTER 6 ■ HADOOP LOGS: RELATING AND INTERPRETATION
Appenders
For the Log4j framework, an output destination is called an Appender. Currently, Appenders exist for the console,
files, GUI components, remote socket servers, JMS, NT Event Loggers, and remote UNIX Syslog daemons. In other
words, you can define any of these as your output destinations for logging. As of Log4j Version 2, you also can log
asynchronously, to pass the control back from the Logger to the application while I/O operations are performed in the
background by a separate thread or process. Asynchronous logging can improve your application’s performance.
Appender Additivity
Multiple Appenders can be attached to a Logger. Each enabled logging request for a given Logger will be forwarded
to all the Appenders in that Logger as well as the Appenders higher in the hierarchy. This is a default behavior known
as Appender additivity and can easily be disabled by setting the Additivity flag to false in the log4j.properties
configuration file.
Consider the example in Figure 6-6. If a console Appender is added to the root Logger, then all enabled logging
requests will display on the console. In addition, if a file Appender is added to the Loggers L1, L1.L2, and L1.L2.L3,
then logging requests for L1, L1.L2, and L1.L2.L3 will be written to the appropriate files and displayed on the console.
Now suppose you set Logger L4’s Additivity flag to false. This effectively disconnects L4 and its children from the
upward propagation of log output. Because the parent of Logger L4.L5 (which is L4 in the example) has its Additivity
flag set to false, L4.L5’s output will be directed only to the Appenders in L4.L5 (in this case none) and its ancestors up
to and including L4 (File4), but will not propagate to L1, L2, or L3. Figure 6-6 tabulates the results.
Logger | Appender | Additivity | Output | Comment |
Root | Console | not applicable | Console | No default appender for root |
L1 | File1 | True | Console, File1 | Appenders of "L1" and root |
L1.L2 | None | True | Console, File1 | Appenders of "L1" and root. |
L1.L2.L3 | File2 | True | Console, | Appenders in "L1.L2.L3", "L1" |
L4 | File4 | False | File4 | No appender accumulation |
L4.L5 | None | True | File4 | Only appenders of L4 since |
Figure 6-6. Appender additivity for Log4j framework
The Appenders frequently used by the major Hadoop distributions are:
Console Appender: Displays log messages on the console
File Appender: Writes log messages to a specific file, which you define in log4j.properties
Rolling file Appender: Writes log messages to files and rolls them based on size
Daily rolling file Appender: Writes log messages to files and rolls them on a daily basis
Using the same entry as for the HDFS Audit Logger (Figure 6-5), consider the Appender section presented in
Figure 6-7.
102
CHAPTER 6 ■ HADOOP LOGS: RELATING AND INTERPRETATION

Figure 6-7. Rolling file Appender for HDFS Audit Logger
In Figure 6-7, I used the RollingFileAppender with HDFS audit Logger. The output is formatted as per the
Layout (PatternLayout) and the defined conversion pattern (I will discuss Layout and conversion patterns shortly),
and looks like this:
2014-02-09 16:00:00,683 INFO FSNamesystem.audit: allowed=true ugi=hdfs (auth:SIMPLE)
ip=/127.0.0.1 cmd=getfileinfosrc=/user/sqoop2/.Trash/Current dst=null perm=null
Note HDFS audit output may result in a large file. Therefore, it is a good idea to have it roll off to a new file on a daily
basis or by size.
Layout
A Layout is an output format for a log entry. It can be associated with an Appender and can format the logging request
as per your specifications before that request is delivered via an Appender.
It’s important to structure and present information in a way that makes reading and interpretation easy. Often it
is necessary to pass logging information to another error-processing program running on a remote machine. So, it is
important to decide on a structure for logging information. This is what the Layout objects provide.
Layouts use conversion patterns to format and output the log message. A conversion pattern consists of a format
modifier and conversion characters. For example, the modifier t outputs the name of the thread that generated the
logging event, and the conversion characters %5p display (or write) the log level using five characters with space
padding on left. So, log level INFO is displayed (or written) as "INFO".
A Layout can be specified for an Appender in the log4j.properties file. For example, I specified PatternLayout
as a layout (for our HDFS audit log Appender) in Figure 6-8.
103
CHAPTER 6 ■ HADOOP LOGS: RELATING AND INTERPRETATION

Figure 6-8. PatternLayout for HDFS Audit Logger
The conversion pattern %d{ISO8601} %p %c{2}: %m%n from Figure 6-8 outputs as:
2014-01-27 20:34:55,508 INFO FSNamesystem.audit: allowed=true ugi=mapred (auth:SIMPLE)
ip=/127.0.0.1 cmd=setPermissionsrc=/tmp/mapred/system/jobtracker.info dst=null
perm=mapred:supergroup:rw-------
The first field is the date/time in ISO8601 (YYYY-MM-DD HH:mm:ss,SSS) format. The second field is the level or
priority of the log statement. The third is the category, the fourth field is the message itself, and the fifth field is the line
separator (newline or /n).
Apache Log4j offers several Layout objects:
Simple Layout: org.apache.log4j.SimpleLayout provides a very basic structure for the
logging message. It includes only the level of the logging information and the logging message
itself. This is how the log message for HDFS Audit Logger (from Figure 6-8) will be output if
Simple Layout is used instead of PatternLayout:
INFO allowed=true ugi=hdfs (auth:SIMPLE) ip=/127.0.0.1
cmd=getfileinfo src=/user/sqoop2/.Trash/Currentdst=null perm=null
Thread-Time-Category-Context Layout (TTCCLayout): This Layout outputs the invoking
thread, time (in milliseconds since application started), the category or Logger used to create
this logging event, and nested diagnostic context. All these properties are optional and if they
are all disabled, the Layout will still write out the logging level and the message itself, just like
Simple Layout. If you specify the following options in log4j.properties:
#configuring the Appender CONSOLE
log4j.appender.CONSOLE=org.apache.log4j.ConsoleAppender
log4j.appender.CONSOLE.layout=org.apache.log4j.TTCCLayout
#configuring theLayout TTCCLayout
log4j.appender.CONSOLE.layout.ThreadPrinting=false
log4j.appender.CONSOLE.layout.ContextPrinting=false
log4j.appender.CONSOLE.layout.CategoryPrefixing=false
log4j.appender.CONSOLE.layout.DateFormat= ISO8601
104
CHAPTER 6 ■ HADOOP LOGS: RELATING AND INTERPRETATION
You get the following output:
INFO allowed=true ugi=hdfs (auth:SIMPLE) ip=/127.0.0.1
cmd=getfileinfo src=/user/sqoop2/.Trash/Currentdst=null perm=null
DateLayout: As the name suggests, this Layout provides date formats such as NULL (no date/
time displayed), RELATIVE (displays time elapsed after application start), DATE (dd MMM
YYYY HH:mm:ss,SSS pattern; final SSS is time elapsed after application start), ABSOLUTE
(HH:mm:ss,SSS pattern), and ISO8601 (yyyy-MM-dd HH:mm:ss,SSS pattern).
HTMLLayout: Your application might need to present log information in a nice, visually
appealing HTML-formatted file. org.apache.log4j.HTMLLayout is the relevant object. A big
advantage of having the log file in HTML format is that it can be published as a web page for
remote viewing.
XMLLayout: To render logging information in a portable (across multiple application
modules) format, Log4j provides the org.apache.log4j.xml.XMLLayout object. It is important
to note that the final output is not a well-formed XML file. This Layout object produces logging
information as a number of <log4j:event> elements.
PatternLayout: You can use this Layout to “format” or output log messages using a
consistent pattern to facilitate their use by an external entity. The relevant Layout object is
org.apache.log4j.PatternLayout. The formatting is specified by format modifiers
(e.g. m writes the log message, p writes the log level information) in a conversion pattern such
as %d{ISO8601} %p %c{2}: %m%n. The display (or write) information is specified by conversion
characters. For example, %10c instructs that the Logger name must be 10 characters, and if
it’s shorter, to add space padding on left. Specifying %-10c indicates space padding should be
added to the right. For more details on the PatternLayout class and conversion characters, see:
http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/PatternLayout.html.
Filters
Filters evaluate log events and either allow them to be published or not. There are several types of Filters, and they
screen out events based on such criteria as number of events (BurstFilter); a log-event message matching a regular
expression (RegexFilter); or the event ID, type, and message (StructuredDataFilter). The type of filter determines
where you need to specify it:
Context-wide Filters are configured as a part of the configuration (LoggerConfig) and evaluate
events before passing them to Loggers for further processing.
Logger Filters are configured for a Logger and are evaluated after the Context-wide Filters and
the log level for the Logger.
Appender Filters are configured for an Appender and determine if a specific Appender should
publish the event.
Appender Reference Filters are configured for a Logger and determine if a Logger should route
the event to an Appender.
105
CHAPTER 6 ■ HADOOP LOGS: RELATING AND INTERPRETATION
Please note that all of these Filters need to be specified in the appropriate section (for a Logger or an Appender)
in your log4j.properties file. For example, Figure 6-9 shows a section from log4j.properties that defines a
RegexFilter to capture HDFS auditing events for login root only:

Figure 6-9. RegexFilter for HDFS Audit Logger
You can similarly use other types of Filters to prevent capture of unwanted events, which will help keep the size of
audit log small and make focusing on specific issues easier.
Reviewing Hadoop Audit Logs and Daemon Logs
As you’ve learned, you can use the Log4j component to generate log output for many purposes (e.g., debugging,
operational stats, auditing). The logging data Log4j outputs is, in turn, generated by system daemon processes, and a
particular type of data may exist in multiple places. How do you connect and analyze data from disjoint sources to get
the total view of system operations, history, and state? The key is Hadoop’s audit logs. This section will discuss which
daemon processes generate which data, what kind of data is captured by auditing, and how you can use Hadoop audit
logs for security proposes.
To get a complete system picture, you need to understand what kind of data is logged by Hadoop daemons or
processes (that generate logs) and where these log files reside. You also need to understand how the captured data
differs with configured logging level. The auditing data from HDFS, for example, doesn’t have details of jobs executed.
That data exists elsewhere, so connecting a job with HDFS access audits requires some work. You have to know where
logs for JobTracker, TaskTracker (MapReduce V1), and ResourceManager (MapReduce V2) are or where log data for
Task attempts is stored. You will need it for a complete audit of data access (who/what/where), and you certainly may
need it in case of a security breach.
It is a major issue with Hadoop auditing that there is no direct or easy way to relate audit data with Job data. For
example, JobTracker and TaskTracker logs (along with task attempt log data) can provide details of jobs executed
and all the statistics related to jobs. But how can you relate this data with audit data that only has details of all HDFS
access? You will learn a couple of possible ways later in this chapter.
Audit Logs
Auditing in Hadoop is implemented using the Log4j API, but is not enabled by default. Hadoop provides an HDFS audit
log that captures all access to HDFS and the MapReduce audit log, which captures information about all submitted jobs
for a Hadoop cluster. The location of audit logs is specified using the environment variable HADOOP_LOG_DIR defined
in the hadoop-env.sh configuration file located in $HADOOP_INSTALL/hadoop/conf directory ($HADOOP_INSTALL is
106
CHAPTER 6 ■ HADOOP LOGS: RELATING AND INTERPRETATION
the directory where Hadoop is installed). The audit log file names are defined in the log4j.properties file, and the
defaults are hdfs-audit.log (for the HDFS audit log) and mapred-audit.log (for the MapReduce audit log). You can’t
define audit logging for YARN using log4j.properties yet; this is still being worked on (see “Add YARN Audit Logging
to log4j.properties,” https://issues.apache.org/jira/browse/HADOOP-8392).
To enable auditing, you need to modify the log4j.properties configuration file by changing the logging
level of the appropriate Logger from WARN to INFO. You’ll find the file in the /etc/Hadoop/conf directory or the
$HADOOP_INSTALL/hadoop/conf directory, where $HADOOP_INSTALL is the Hadoop installation directory.
log4j.properties defines the logging configuration for NameNode and the other Hadoop daemons (JobTracker,
TaskTracker, NodeManager, and ResourceManager). For example, to enable HDFS auditing, look for this line in the
log4j.properties file:
log4j.logger.org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit=WARN
Replace WARN with INFO to enable HDFS auditing and ensure a log line written to the HDFS audit log for every
HDFS event.
Likewise, to enable MapReduce auditing, set its Logger to the INFO level:
log4j.logger.org.apache.hadoop.mapred.AuditLogger=INFO
Figure 6-10 shows a section from log4j.properties defining the HDFS auditing configuration.

Figure 6-10. HDFS audit logging configuration
Hadoop Daemon Logs
Hadoop daemon logs are logs generated by Hadoop daemons (NameNode, DataNode, JobTracker, etc.) and located under
/var/log/hadoop; the actual directories may vary as per the Hadoop distribution used. The available logs are as follows:
NameNode logs (hadoop-hdfs-namenode-xxx.log) containing information about file opens
and creates, metadata operations such as renames, mkdir, and so forth.
DataNode logs (hadoop-hdfs-datanode-xxx.log) containing information about DataNode
access and modifications to data blocks.
Secondary NameNode logs (hadoop-hdfs-secondarynamenode-xxx.log) containing
information about application of edits to FSimage, new FSimage generation, and transfer to
NameNode.
107
CHAPTER 6 ■ HADOOP LOGS: RELATING AND INTERPRETATION
JobTracker logs (hadoop-xxx-mapreduce1-jobtracker-xxx.log), containing information
about jobs executed. JobTracker creates an xml file (job_xxx_conf.xml) for every job that runs
on the cluster. The XML file contains the job configuration. In addition, JobTracker creates
runtime statistics for jobs. The statistics include task attempts, start times of tasks attempts,
and other information.
TaskTracker logs (hadoop-xxx-mapreduce1-tasktracker-xxx.log), containing information
about tasks executed. TaskTracker creates logs for task attempts that include standard error
logs, standard out logs, and Log4j logs.
ResourceManager (yarn-xxx-resourcemanager-xxx.log) and Job History server logs
(mapred-xxx-historyserver-xxx.log), containing information about job submissions, views,
or modifications. These are available only if you use MapReduce V2 or YARN.
As with audit logs, you can specify the logging level of the Hadoop daemons in the configuration file
log4j.properties, and each daemon can have a different level of logging if required. For example, you could set the
Audit Logger for HDFS to the INFO level and instruct TaskTracker to log at level TRACE:
log4j.logger.org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit=INFO
log4j.logger.org.apache.hadoop.mapred.TaskTracker=TRACE
Please note that other components (e.g., Hive, HBase, Pig, Oozie, etc.) have corresponding log4j.properties
files in their own configuration directories.
Any operational Hadoop cluster has a number of scheduled (and unscheduled or ad hoc) jobs executing at
various times, submitted by any of the approved users. As mentioned, it is challenging to correlate job logs with the
HDFS access logs captured via auditing. For example, consider this typical row found in audit records:
2013-10-07 08:17:53,438 INFO FSNamesystem.audit: allowed=true ugi=hdfs (auth:SIMPLE) ip=/127.0.0.1
cmd=setOwner src=/var/lib/hadoop-hdfs/cache/mapred/mapred/stagingdst=null perm=mapred:supergroup:r
wxrwxrwt
All this row says is that a command (setOwner in this case) was executed on a source file, but it doesn’t indicate if
it was executed as part of a job.
You would need to refer to the corresponding JobTracker or TaskTracker logs to see if there were any jobs
executing at that time, or else assume that it was an ad hoc operation performed using a Hadoop client. Therefore, you
need to maintain logs of other Hadoop daemons or processes in addition to audit logs and correlate them for effective
troubleshooting.
Correlating and Interpreting Log Files
Hadoop generates a lot of logs. There are audit logs and daemon logs that separately provide a lot of information
about the processing done at the sources from which they are gathered. However, they don’t form a cohesive,
complete picture of all the processing performed at your Hadoop cluster. That’s the reason you need to correlate these
logs while troubleshooting an issue or investigating a security breach.
Correlating Hadoop audit data with logs generated by Hadoop daemons is not straightforward and does require a
little effort, but the results are well worth it. Using a username or job number as well as Linux filters (e.g. sed or stream
editor utility), you can relate the data and identify security breaches.
108
What to Correlate?
CHAPTER 6 ■ HADOOP LOGS: RELATING AND INTERPRETATION
Hadoop daemons log a lot of useful information, and you can also enable and gather audit logs. Assuming you have
all these logs available, what should you correlate? Well, that depends on the event you are trying to investigate.
Consider a possible security breach in Chapter 3’s ticketing system example. As you remember, all the police
stations send their ticketing data nightly to the central repository at police headquarters. The central repository holds
the ticketing data in a Hive table that has partitions for each day. Every day, an IT professional runs automated process
to add a new partition using the data received.
One day, one of the IT professionals decided to help out his girlfriend by removing her speeding ticket entry. He
was caught due to analysis conducted using correlated logs. He removed the ticket entry from the ticketing table, but
forgot to remove the corresponding entries from judiciary-related tables, and the system flagged errors when the case
was due for a hearing. Subsequently, a thorough investigation was conducted. Let’s follow the trail as it unfolded; the
unprofessional IT professional goes by the username RogueITGuy.
When the error was detected, the system administrator checked access to HDFS using the following:
HDFS audit log: This provided details of all commands users executed on a cluster. Because
Ticket_details was the table that was missing a record, investigators focused on it and filtered
out access by user root and HDFS superuser hdfs (since both are system users with controlled
passwords) to get a list of users who accessed Ticket_details. To filter, investigators (the team
including the system administrator) used the following shell command:
grep Ticket_details hdfs-audit.log | grep -v 'ugi=root' | grep -v 'ugi=hdfs'
(The -v option for command grep filters records with the keyword specified after the option.)
The results included normal user activity plus the following suspicious activity by a user
RogueITGuy:
2014-03-06 22:26:08,280 INFO FSNamesystem.audit: allowed=true ugi=RogueITGuy
(auth:SIMPLE) ip=/127.0.0.1cmd=getfileinfo
src=/Ticketing/Ticket_details_20140220
dst=null perm=null
2014-03-06 22:26:08,296 INFO FSNamesystem.audit: allowed=true ugi=RogueITGuy
(auth:SIMPLE) ip=/127.0.0.1cmd=rename
src=/Ticketing/Ticket_details_20140220
dst=/Ticketing/Ticket_stg/Ticket_details_20140220
perm=RogueITGuy:supergroup:rw-r--r—
2014-03-06 22:27:02,666 INFO FSNamesystem.audit: allowed=true ugi=RogueITGuy
(auth:SIMPLE) ip=/127.0.0.1cmd=open
src=/Ticketing/Ticket_stg/Ticket_details_20140220 dst=null
perm=null
Investigators concluded the following:
User RogueITGuy (ugi=RogueITGuy) loaded a new version of daily staging file
Ticket_details_20140220 (cmd=rename
src=/Ticketing/Ticket_details_20140220
dst=/Ticketing/Ticket_stg/Ticket_details_20140220).
File was loaded to HDFS location that points to external staging table Ticket_details_
stg, which is used to load data to the Ticket_details table by creating and overwriting
the partition for a particular day.
109
CHAPTER 6 ■ HADOOP LOGS: RELATING AND INTERPRETATION
The first entry (cmd=getfileinfo src=/Ticketing/Ticket_details_20140220) was to make
sure he had the correct (modified with his girlfriend’s ticket entry removed) file uploaded from
his PC.
The third entry was to make sure that the modified file was uploaded to the staging
location correctly.
Hive log: If this user overwrote a partition with the modified file, he would have done that
using Hive. So, investigators looked at the Hive logs next (in /var/log/hive for Cloudera
CDH4; may vary as per your distribution and configuration):
grep 'ugi=RogueITGuy' hadoop-cmf-hive1-HIVEMETASTORE-localhost.localdomain.log.out
| grep 'ticket_details' | grep -v 'get_partition'
They searched for activity by RogueITGuy in the table Ticket_details and, after reviewing the
output, filtered out 'get_partition' entries, since that command does not modify a partition.
Here’s what they saw:
2014-03-06 22:42:36,948 INFO
org.apache.hadoop.hive.metastore.HiveMetaStore.audit: ugi=RogueITGuy
ip=/127.0.0.1 cmd=source:/127.0.0.1get_table : db=default tbl=ticket_details
2014-03-06 22:42:37,184 INFO
org.apache.hadoop.hive.metastore.HiveMetaStore.audit: ugi=RogueITGuy
ip=/127.0.0.1 cmd=source:/127.0.0.1append_partition: db=default
tbl=ticket_details[2014,2,20]
Investigators drew the following conclusions:
The partition for 2/20/14 was overwritten (ugi=RogueITGuy ip=/127.0.0.1
cmd=source:/127.0.0.1 append_partition:db=default tbl=ticket_
details[2014,2,20]) for table Ticket_details by RogueITGuy.
The file Ticket_details_20140220 was uploaded on 3/6/14 22:26 and the Hive partition was
overwritten on 3/6/14 22:42 by the same user—RogueITGuy. Case closed!
Last, investigators checked the jobs submitted by RogueITGuy. Several job-related logs provided details of jobs
users executed. Investigators started by reviewing the MapReduce audit logs, which contain all the user, date/time, and
result details of submitted jobs. For Cloudera their location is /var/log/hadoop-0.20-mapreduce/mapred-audit.log.
Investigators next issued the following command:
grep 'RogueITGuy' mapred-audit.log
It yielded a couple of jobs:
2014-03-06 22:28:01,590 INFO mapred.AuditLogger: USER=RogueITGuy IP=127.0.0.1
OPERATION=SUBMIT_JOB TARGET=job_201403042158_0008RESULT=SUCCESS
2014-03-06 22:42:07,415 INFO mapred.AuditLogger: USER=RogueITGuy IP=127.0.0.1
OPERATION=SUBMIT_JOB TARGET=job_201403042158_0009RESULT=SUCCESS
2014-03-06 22:45:55,399 INFO mapred.AuditLogger: USER=RogueITGuy IP=127.0.0.1
OPERATION=SUBMIT_JOB TARGET=job_201403042158_0010RESULT=SUCCESS
2014-03-06 22:47:39,380 INFO mapred.AuditLogger: USER=RogueITGuy IP=127.0.0.1
OPERATION=SUBMIT_JOB TARGET=job_201403042158_0011RESULT=SUCCESS
2014-03-06 22:48:46,991 INFO mapred.AuditLogger: USER=RogueITGuy IP=127.0.0.1
OPERATION=SUBMIT_JOB TARGET=job_201403042158_0012RESULT=SUCCESS
110
CHAPTER 6 ■ HADOOP LOGS: RELATING AND INTERPRETATION
Investigators checked the JobTracker and TaskTracker logs using the web interface for JobTracker at
http://JobTrackerHost:50030/JobTracker.jsp. Jobs job_201403042158_00010, job_201403042158_0011, and
job_201403042158_0012 were Select statements that didn’t modify any data, but jobs job_201403042158_0008
and job_201403042158_0009 led to conclusive proof! Investigators reviewed the hive.query.string property in the
job.xml file for these jobs and retrieved the query that was executed, which was:
FROM Ticket_details_stg INSERT OVERWRITE TABLE Ticket_details PARTITION (Yr=2014,Mo=2,Dy=20) SELECT
TicketId,DriverSSN,Offense,IssuingOfficer
The query used the data from the Ticket_details_stg table (a daily staging table) to overwrite a partition
for date 2/20/14 for table Ticket_details. The HDFS audit logs already established that RogueITGuy had loaded a
temporary data file to staging table.
Together, the logs made clear that RogueITGuy edited the daily temporary data file and removed the record
that contained ticket entry for his girlfriend. Then he uploaded the new file to the staging table and used the staging
table to overwrite a partition for the Ticket_details table to make sure that the ticket entry was removed. Using
the HDFS audit log, Hive log, MapReduce audit log and job.xml files, investigators obtained conclusive evidence of
unauthorized activities performed by RogueITGuy and were able to successfully conclude the investigation.
As a result, RogueITGuy lost his job, and his girlfriend had to pay the ticket. She was so touched by his devotion,
however, that she agreed to marry him. So, in the end, even RogueITGuy thanked correlated logs!
How to Correlate Using Job Name?
There are several ways you can correlate the logs. The easiest is using login name or job name, because log messages
contain this information. You saw how the RogueITGuy username led to correlating the various log files to investigate
unauthorized activities. Relating the logs using job names was an important step, as well. To track down the security
breach, investigators had to extract relevant information from the logs and use job name to relate multiple logs to get
details of what activities were performed for a particular job.
I will walk you through that process now, starting with the MapReduce audit log (mapred-audit.log), which has
entries as shown in Figure 6-11.

Figure 6-11. MapReduce audit log
Notice the highlighted entry with the job name job_201403042158_0008. The HDFS audit log has multiple entries
for this job. How do you filter them out?
If you look at the first occurrence of an entry for this job (in hdfs-audit.log), you will observe that it has the
pattern cmd=getfileinfo along with job name job_201403042158_0008. This holds true for Cloudera’s Hadoop
distribution (CDH4) and if you have use a different distribution, you will need to identify a unique pattern for
111
CHAPTER 6 ■ HADOOP LOGS: RELATING AND INTERPRETATION
the first and last occurrences of a particular job. The good news is that you only have to perform this exercise once for
a Hadoop distribution. You simply have to establish a unique pattern for the first and last occurrence of the job name
that separates it from subsequent occurrences; then you can use it for all your searches.
Subsequently, you can use the Linux utility awk to get the line number for first occurrence of this pattern:
awk '/cmd\=getfileinfo/ && /job_201403042158_0008\t/ { print NR }' hdfs-audit.log
The awk utility looks for the first line that matches the patterns cmd=getfileinfo and job_201403042158_0008
and uses the built-in variable NR to output line number.
Also, you can get the line number for last occurrence of a job name by using the patterns cmd=delete and
/src=/tmp/mapred/system/ job_201403042158_0008 like:
awk '/cmd\=delete/ && /src=\/tmp\/mapred\/system\/job_201403042158_0008/ { print NR }'
hdfs-audit.log
After that, you can just use a stream editor, such as sed, to print lines starting with the first pattern and ending
with the second pattern. For example, sed –n 1,20p hdfs-audit.log will display lines 1 to 20 from file
hdfs-audit.log on the screen.
sed -n `awk '/cmd\=getfileinfo/ && /job_201403042158_0008\t/ { print NR }' hdfs-audit.log`,`awk
'/cmd\=delete/ && /src=\/tmp\/mapred\/system\/ job_201403042158_0008/ { print NR }' hdfs-audit.log`p
hdfs-audit.log
The sed command uses line numbers obtained in earlier steps (marked with bold and italic) as start and end to
print all the lines in between. You can redirect the output of command sed to a file and review the HDFS audit records,
instead of watching them on the screen (as implied by the last sed command). You can use this sed command to
extract job details from hdfs-audit.log for any jobs (for CDH4)—just substitute the job name!
Now, in this case, you didn’t get much information from hdfs-audit.log entries, except that this job did Hive-
related processing and also showed the location of job.xml:
2014-03-06 22:27:59,817 INFO FSNamesystem.audit: allowed=true ugi=RogueITGuy (auth:SIMPLE)
ip=/127.0.0.1 cmd=createsrc=/user/RogueITGuy/.staging/job_201403042158_0008/libjars/hive-
builtins-0.10.0-cdh4.4.0.jar dst=null perm=RogueITGuy:supergroup:rw-r--r—
2014-03-06 22:28:02,184 INFO FSNamesystem.audit: allowed=true ugi=RogueITGuy (auth:SIMPLE)
ip=/127.0.0.1 cmd=getfileinfosrc=/user/RogueITGuy/.staging/job_201403042158_0008/job.xml
dst=null perm=null
2014-03-06 22:28:02,324 INFO FSNamesystem.audit: allowed=true ugi=RogueITGuy
(auth:SIMPLE) ip=/127.0.0.1 cmd=getfileinfo src=/tmp/hive-RogueITGuy/hive_2014-03-06_22-
27-55_562_981696949097457901-1/-mr-10004/164c8515-a032-4b6f-a551-9bc285ce37c4 dst=null
perm=null
Why not just use grep command to retrieve the job details for job job_201403042158_0008 in hdfs-audit.log?
The reason is that all the lines pertaining to job job_201403042158_0008 may not contain the job name pattern and
you want to make sure you don’t miss any relevant lines from log file hdfs-audit.log.
112
Using Job Name to Retrieve Job Details
CHAPTER 6 ■ HADOOP LOGS: RELATING AND INTERPRETATION
You can use the same technique of finding a unique pattern for the first occurrence to retrieve records relevant to a job
from the JobTracker or TaskTracker logs. For example, to look for a pattern in the JobTracker log file and get the line
number of the first occurrence of a job, such as job_201403042158_0008, use:
awk '/job_201403042158_0008/ && /nMaps/ && /nReduces/ { print NR }'
hadoop-cmf-mapreduce1-JOBTRACKER-localhost.localdomain.log.out
To retrieve the line number for last occurrence for ‘job_201403042158_0008’, use:
awk '/job_201403042158_0008/ && /completed successfully/ { print NR }'
hadoop-cmf-mapreduce1-JOBTRACKER-localhost.localdomain.log.out
You can use command sed to get details from the JobTracker log file for CDH4 by specifying the job name.
For example, the sed command to print out all records for job_201403042158_0008 is:
sed -n `awk '/job_201403042158_0008/ && /nMaps/ && /nReduces/ { print NR }' hadoop-cmf-mapreduce1-
JOBTRACKER-localhost.localdomain.log.out`,`awk '/job_201403042158_0008/ && /completed successfully/
{ print NR }' hadoop-cmf-mapreduce1-JOBTRACKER-localhost.localdomain.log.out`p hadoop-cmf-
mapreduce1-JOBTRACKER-localhost.localdomain.log.out
The command’s output provides valuable details such as the nodes tasks were executed on or where the task
output is located:
2014-03-06 22:28:01,394 INFO org.apache.hadoop.mapred.JobInProgress: job_201403042158_0008: nMaps=1
nReduces=0 max=-1
2014-03-06 22:28:01,764 INFO org.apache.hadoop.mapred.JobInProgress: Input size for job
job_201403042158_0008 =74. Number of splits = 1
2014-03-06 22:28:01,765 INFO org.apache.hadoop.mapred.JobInProgress: tip:task_201403042158_0008_m_00
0000 hassplit on node:/default/localhost.localdomain
2014-03-06 22:28:01,765 INFO org.apache.hadoop.mapred.JobInProgress: Job job_201403042158_0008
initialized successfullywith 1 map tasks and 0 reduce tasks.
2014-03-06 22:28:02,089 INFO org.apache.hadoop.mapred.JobTracker: Adding task (JOB_SETUP)
'attempt_201403042158_0008_m_000002_0' totip task_201403042158_0008_m_000002, for tracker
'tracker_localhost.localdomain:localhost.localdomain/127.0.0.1:47799'
Using Web Browser to Retrieve Job Details
You can also review the JobTracker and TaskTracker log records easily using the browser interface. The runtime
statistics for a job or XML file for a job are best reviewed using the browser interface. The URL for the records is
composed of the tracker’s name and web access port. If your JobTracker host is called 'MyJobHost' and uses port
50030 for web access, for example, then the JobTracker logs can be reviewed at http://MyJobHost:50030/logs/.
Likewise, logs for a TaskTracker running on host 'MyTaskHost' and using port 50060 can be reviewed at
http://MyTaskHost:50060/logs/. Check your configuration file (mapred-site.xml) for particulars of hosts running
specific daemons and ports. Filenames may vary by distributions, but log files will have TaskTracker or JobTracker
in their names, making them easy to identify.
113
CHAPTER 6 ■ HADOOP LOGS: RELATING AND INTERPRETATION
Figure 6-12 shows a logs directory and various MapReduce log files for a cluster using MapReduce 1.0.

Figure 6-12. MapReduce log files for MapReduce Version 1
If you are using YARN, then the corresponding daemons are ResourceManager (instead of JobTracker) and
NodeManager (instead of TaskTracker). Please check the YARN configuration file (yarn-site.xml) for web access
ports (values of mapreduce.johistory.webapp.address and yarn.resourcemanager.webapp.address). For example,
in Figure 6-13, the ResourceManager uses port 8088.

Figure 6-13. ResourceManager web interface for YARN
The NodeManager uses port 8042, as shown in Figure 6-14.
114
CHAPTER 6 ■ HADOOP LOGS: RELATING AND INTERPRETATION

Figure 6-14. NodeManager web interface for YARN
Last, the Historyserver uses port 19888 (Figure 6-15).

Figure 6-15. HistoryServer web interface for YARN
The YARN logs for NodeManager and ResourceManager should be used to get job details when YARN is used.
Historyserver holds logs for archived or “retired” jobs. So, if you need to access older job details, that’s what you need to
check. The patterns to locate first and last lines may change slightly and might need to be adjusted; but you can easily
browse through the log files to make those adjustments. An easy way to find out location of the YARN log files is to refer
to the log4j.properties file located in /etc/hadoop/conf and see where the appropriate Appenders are pointing.
A thought before I conclude the chapter. You have seen how to relate logs for a job, but what if you want to
trace all the activity for a user or you want to trace activity for a whole day? Defining and using awk patterns would
be cumbersome, difficult, and error-prone. Instead, try defining Log4j Filters for Appenders, as well as defining
additional Appenders to direct relevant output to separate files for an issue, and consolidate all the files for an issue.
You can either use Flume for that purpose or simply have your shellscripts do the consolidation for you.
Important Considerations for Logging
Some additional factors will help you make effective use of logging. Although they are not directly relevant to
security, I will mention them briefly in this section and you can decide how relevant they are for your individual
environments.
115
CHAPTER 6 ■ HADOOP LOGS: RELATING AND INTERPRETATION
Time Synchronization
Hadoop is a distributed system with multiple nodes—often a large number of them. Therefore, Hadoop logs are also
distributed across the various nodes within your cluster. Individual log messages are timestamped, and while you are
troubleshooting, you need to be sure that 12:00 PM on one node is the same moment of time as specified by 12:00 PM
on another node.
For a network, clock skew is the time difference in the clocks for different nodes on the network. Usually, a time
difference in milliseconds is acceptable; but a larger clock skew needs to be avoided. A number of protocols
(e.g., Network Time Protocol, http://en.wikipedia.org/wiki/Network_Time_Protocol) can be used to make sure
that there is negligible time skew. It is certainly important to make sure that the generated logs for your cluster are
time synchronized.
Hadoop Analytics
In the section “Correlating and Interpreting Log Files,” I have discussed how a combination of the Linux stream editor
sed and the powerful text processor awk can be used to search for a pattern and print the appropriate lines. You can
easily extend this method to counting the lines that match a pattern. You can make multiple passes on the log files
and aggregate the matches to analyze the usage patterns. Analytics so generated might not be useful for security
investigations, but they can certainly provide useful statistics for your Hadoop cluster.
For example, the following command can tell you how many times the user RogueITGuy accessed your Hadoop
cluster since it was started (you can of course easily extract the date range for access as well):
grep 'ugi=RogueITGuy' hdfs-audit.log | wc -l
The following command tells you how many jobs were executed by RogueITGuy since your cluster restarted:
grep 'USER=RogueITGuy' mapred-audit.log | wc -l
The following script extracts the start and end date/time for job job_201403042158_0008 (you can then compute
the job duration):
awk –F ',' '/cmd\=getfileinfo/ && /job_201403042158_0008\t/ { print $1 }' hdfs-audit.log
awk –F ',' '/cmd\=delete/ && /src=\/tmp\/mapred\/system\/job_201403042158_0008/ { print $1 }'
hdfs-audit.log
You can develop automated scripts that write all the daily job analysis or HDFS access analysis to files and add
them as partitions for appropriate Hive tables. You can then perform aggregations or use other statistical functions on
this data for your own analytical system.
Of course, the analytics that are more relevant to you may vary, but I am sure you understand the method
behind them.
This historical data (stored as Hive tables) can also be used for generating security alerts by defining variation
thresholds. For example, you can write a Hive query to generate an alert (through Nagios) if a user executes twice
(or more) the number of jobs as compared to his monthly average. The use of historical data for security alerts will
always rely on sudden change in usage, and you can use the concept as applicable to your environment.
Splunk
Splunk is a very powerful tool for analyzing Hadoop data. Using the Hadoop Connect module of Splunk, you can
import any HDFS data and use the indexing capability of Splunk for further searching, reporting, analysis, and
visualization for your data. You can also import Hadoop logs, index them, and analyze them.
116
CHAPTER 6 ■ HADOOP LOGS: RELATING AND INTERPRETATION
Splunk provides a powerful search processing language (SPL) for searching and analyzing real-time and
historical data. It can also provide capability of real-time monitoring of your log data (for patterns/thresholds) and
generate alerts when specific patterns occur (within your data). For example, if you are using Hive, you want to know
when a partition (for one of your Production tables) is overwritten or added. You might also want to alert your system
administrator when one of the users connects to your cluster.
Splunk’s most important capability (from security logging perspective) is its ability to correlate data. Splunk
supports the following ways to correlate data:
Time and geolocation based: You can correlate data for events that took place over a
specific date or time duration and at specific locations. So, if I had used Splunk to conduct
investigation for RogueITGuy, I could have asked Splunk to give me all the log data for 3/16/14
(the specific date when the issue occurred).
Transaction based: You can correlate all the data for a business process (or series of business
processes) and identify it as a single event. Even though it can’t be used for security, it can
provide analytics for a job or a business process (such as duration, CPU and RAM resources
consumed, etc.).
Sub-searches: Allow you to use the results of one search and use them in another. So, if I had
used Splunk to conduct investigation for RogueITGuy, then I could define my sub-searches to
HDFS, MapReduce, or Hive access for easier analysis.
Lookups: Allow you to correlate data from external sources. For instance, I could have
checked all Hive alerts from Nagios to see if RogueITGuy was involved in any other issues.
Joins: Allows you to link two completely different data sets together based on a username
or event ID field. Using Splunk, I could link monitoring data from Ganglia and Hadoop log
data using username RogueITGuy and investigate what else he accessed while performing his
known illegal activities.
Last, Splunk offers Hunk, which is an analytics tool specifically designed for Hadoop and NoSQL Data. It lets you
explore, analyze, and visualize raw, unstructured data. Hunk also offers role-based access to limit access to sensitive
data (more information at www.splunk.com/hunk). Take a look and see if it is more useful for your needs!
Summary
In this chapter, I discussed how Hadoop logging can be effectively used for security purposes. The high-level approach
is to use Linux utilities and stream editors to process the text in log files and derive the necessary information, but this
is, of course, very old-fashioned and hard work. There are easier ways of achieving similar results by using third-party
solutions such as Splunk.
A large number of third-party products are available for reducing the work involved in troubleshooting
or investigating security breaches. The disadvantage is that you won’t have as much control or flexibility while
correlating or analyzing the logs. The preference is yours—and most of the times it’s dictated by your environment
and your requirements. With either approach, be sure to synchronize time on all the nodes you need to consider
before you can rely on the logs generated.
Last, it is worthwhile to explore the use of Hadoop logs for analytics—be it security related or otherwise.
You can either buy expensive software to perform the analytics or develop your own scripts if you are sure of your
requirements—and if they are small in number!
117
CHAPTER 7
Monitoring in Hadoop
Monitoring, as any system administrator will tell you, is ideal for getting to the root of performance issues. Monitoring
can help you understand why a system is out of CPU or RAM resources, for example, and notify you when CPU or
RAM usage nears a specified percent. What your system administrator may not know (but you can explain after
reading this chapter) is that monitoring is equally well suited for ferreting out security issues.
Consider a scenario: You manage a Hadoop cluster (as system administrator) and are concerned about two
specific users: Bob, a confirmed hacker, and Steve, who loves to run queries that access volumes of data he is not
supposed to access! To stop password loss and avoid server crashes, you would like to be notified when Bob is trying
to read the /etc/password file and when Steve is running a huge query that retrieves the whole database. Hadoop
monitoring can provide the information you need. Specifically, Hadoop provides a number of Metrics to gain useful
security details, which the leading monitoring systems can use to alert you to trouble. In addition, these monitoring
systems let you define thresholds (for generating alerts) based on specific Metric values and also let you define
appropriate actions (in case thresholds are met) Thus, Hadoop monitoring offers many features you can use for
performance monitoring and troubleshooting.
In this chapter’s detailed overview of monitoring, I will discuss features that a monitoring system needs, with an
emphasis on monitoring distributed clusters. Thereafter, I will discuss the Hadoop Metrics you can use for security
purposes, and introduce Ganglia and Nagios, the two most popular monitoring applications for Hadoop. Last, I will
discuss some helpful plug-ins for Ganglia and Nagios that provide integration between the two programs, as well as
plug-ins that provide security-related functionality.
Overview of a Monitoring System
Monitoring a distributed system is always challenging. Not only are multiple processes interacting with users and
each other, but you must monitor the system without affecting the performance of those processes in any way.
A system like Hadoop presents an even greater challenge, because the monitoring software has to monitor individual
hosts and then consolidate that data in the context of the whole system. It also needs to consider the roles of various
components in context of the whole system. For example, the CPU usage on a DataNode is not as important as the
CPU usage on NameNode. So, how will the system process CPU consumption alerts or identify separate threshold
levels for hosts with different roles within the distributed system? Also, when considering CPU or storage usage
for DataNodes, the monitoring system must consider combined usage for all the DataNodes within a cluster.
Subsequently, the monitoring system needs to have capability of summarizing monitoring thresholds by role as well.
In addition to the complex resource monitoring capabilities, a monitoring system for distributed systems needs
to have access to details of processes executing at any time. This is necessary for generating alerts (e.g., a user process
resulting in 90% CPU usage) or performing any preventive action (e.g., a user is accessing critical system files).
Before you can effectively meet the challenges of monitoring a Hadoop system, you need to understand the
architecture of a simple monitoring system. In the next section, I’ll discuss the components, processing, and features
that you need for monitoring a distributed system effectively, as well as how this simple architecture can be adapted
to be better suited for monitoring a Hadoop cluster.
119
CHAPTER 7 ■ MONITORING IN HADOOP
Simple Monitoring System
A simple monitoring system needs four key components: a server or coordinator process, connections to poll
distributed system hosts and gather the necessary information, a repository to store gathered information, and a
graphical user interface as a front-end (Figure 7-1).
Distributed
system
host A
Repository
Monitoring
Server
Centralized Processing hub
Distributed
system
host B
Distributed
system
host C
Console
![]()
Connections to poll
Distributed systems
Graphical user interface
to display consolidated
monitoring output
Figure 7-1. Simple monitoring system
As you can see, the monitoring server consolidates input received by polling the distributed system hosts and
writes detailed (as well as summarized) output to a repository. A console provides display options for the gathered
data, which can be summarized using various parameters, such as monitoring event, server, type of alert, and so on.
Unfortunately, simple monitoring system architecture like this doesn’t scale well. Consider what would happen if
Figure 7-1’s system had to monitor thousands of hosts instead of three. The monitoring server would have to manage
polling a thousand connections, process and consolidate output, and present it on the console within a few seconds!
With every host added to the monitoring system, the load on the monitoring server will increase. After a certain
number of hosts, you won’t be able to add any more, because the server simply won’t be able tosupport them. Also,
the large volume of polling will add to network traffic and impact overall system performance.
Add to that the complexities of a Hadoop cluster where you need to consider a node’s role while consolidating
data for it, as well as summarizing data for multiple nodes with the same role. The simplistic design just won’t suffice,
but it can be adapted for monitoring a Hadoop cluster.
Monitoring System for Hadoop
A simple monitoring system follows the same processing arrangement as the traditional client-server design: a single,
centralized monitoring server does all the processing, and as the number of hosts increase, so does the processing
load. Network traffic also weighs down the load, as polled data from hosts consolidates on the monitoring server.
Just as Hadoop’s distributed architecture is a marked improvement in efficiency over traditional client-server
processing, a distributed processing model can improve a simple monitoring system as well. If a localized monitoring
process captures and stores monitoring data for each node in a Hadoop cluster, for example, there is no longer a
centralized server to become a processing bottleneck or a single point of failure. Every node is an active participant
performing part of the processing in parallel. Each of these localized processes can then transmit data to other nodes
in the cluster and also receive copies of data from other nodes in the cluster. A polling process can poll monitoring
data for the whole cluster from any of the nodes within the cluster at any predetermined frequency. The data can be
written to a repository and stored for further processing or displayed by a graphical or web based frontend. Figure 7-2
shows a possible design.
120
CHAPTER 7 ■ MONITORING IN HADOOP
Node A
Monitor process
to compute local
Monitoring data
Polling process
Polls from a single node within a
cluster; polling process can run on
any node within the cluster or off it

Repository
Node B
Monitor process
to compute local
Monitoring data
Nodes transmit and receive monitoring
data from each other
Node C
Monitor process
to compute local
Monitoring data
Console
Graphical user interface
to display consolidated
monitoring output
Figure 7-2. Monitoring system for Hadoop
With this architecture, even adding 1000 hosts for monitoring would not adversely affect performance. No
additional load burdens any of the existing nodes or the polling process, because the polling process can still poll from
any of the nodes and doesn’t have to make multiple passes. The cluster nodes transmit data to a common channel
that is received by all other nodes. So, increasing the number of nodes does not impact polling process or system
performance in any way, making the architecture highly scalable. Compared to traditional monitoring systems, the
only extra bit of work that you need to do is to apply the monitoring process configuration to all the nodes.
Taking a closer look at Figure 7-2, notice that the monitoring processes on individual nodes compute “local
monitoring data.” The monitoring data needs to be computed locally; because Hadoop is a multi-node distributed
system where data is spread onto its numerous DataNodes and as per the Hadoop philosophy of “taking processing to
data,” the data is processed locally (where it resides—on the DataNodes). This “local monitoring data” is actually Metric
output for individual nodes; it can tell you a lot about your system’s security and performance, as you’ll learn next.
Hadoop Metrics
Hadoop Metrics are simply information about what’s happening within your system, such as memory usage, number
of open connections, or remaining capacity on a node. You can configure every Hadoop daemon to collect Metrics
at a regular interval and then output the data using a plug-in. The collected data can contain information about
Hadoop daemons (e.g., the resources used by them), events (e.g., MapReduce job executions), and measurements
(e.g., number of files created for NameNode). The output plug-in you use determines the Metric’s destination.
For example, FileContext writes the Metric to a file, GangliaContext passes the Metric passed on to the Ganglia
monitoring system for display and consolidation, and NullContext discards the Metric.
Depending on the information they contain Metrics are classified into four contexts: jvm, dfs, rpc, and mapred. Metrics
for jvm contain basic statistics for JVM (Java Virtual Machine) such as memory usage or thread counts etc. This context is
applicable for all Hadoop daemons. The dfs (distributed file system) context is applicable to NameNode and DataNode.
Some of the Metrics for this context output information such as capacity or number of files (for NameNode), number
of failed disk volumes, remaining capacity on that particular worker node (for DataNode), et cetera. JobTracker and
TaskTracker use the mapred context for their counters. These Metrics contain pre-job counter data, job counters, and post-
job counters. The rpc context is used for remote procedure call (RPC) Metrics such as average time taken to process an RPC,
number of open connections, and the like, and is applicable to all Hadoop daemons. Table 7-1 summarizes the contexts.
121
CHAPTER 7 ■ MONITORING IN HADOOP
Table 7-1. Contexts for Hadoop Metrics
Context | Description | Applicable to | Example |
jvm | Basic statistics for JVM (Java Virtual Machine) | All Hadoop daemons | Memory usage, thread count |
dfs | Distributed file system | NameNode, DataNode | Capacity, failed disk volumes |
mapred | MapReduce | JobTracker, TaskTracker | Job counters |
rpc | Remote procedure calls | All Hadoop daemons | Number of open connections |
Early versions of Hadoop managed Metrics through a system named Metrics, while the current version of
Hadoop uses Metrics2. The management systems have two major differences. Metrics relies on a one-to-one
relationship of one context per plug-in, while Metrics2 enables you to output Metrics to multiple plug-ins. The
Metrics2 system also uses a slightly different terminology; the Metrics data output by Hadoop daemons is referred to
as sources and the plug-ins are called sinks. Sources produce the data, and sinks consume or output the data. Let me
discuss a few Metrics for each of the contexts.
The jvm Context
The jvm Metrics focus on basic JVM statistics. Table 7-2 lists some of these Metrics.
Table 7-2. Metrics for jvm Context
Metric Description
GcCount Number of garbage collections (automated deallocation of heap memory from unused
objects) performed for the JVM
GcTimeMillis Total time for all garbage collections for a JVM (in milliseconds)
LogFatal Number of log lines with error level FATAL (using Log4j)
MemHeapCommittedM Heap memory committed, or the amount of memory guaranteed to be available for use
by the JVM (in MB)
MemHeapUsedM Heap memory currently used by the JVM (includes memory occupied by all objects) (in MB)
ThreadsWaiting Number of threads in WAITING state (i.e., waiting for another thread to complete an action)
You can infer how dynamic your JVM process is by looking at GcCount and GcTimeMillis Metrics; larger
numbers indicate a lot of memory-based activity. A large number of fatal errors indicate a problem with your system
or application, and you need to consult your logs immediately. The memory counter MemHeapUsedM tells you
about total memory usage, and if you see a large number for ThreadsWaiting, you know you need more memory.
122
The dfs Context
CHAPTER 7 ■ MONITORING IN HADOOP
The dfs (distributed file system) Metrics focus on basic file operations (create, delete) or capacity, transactions, and
the like. Table 7-3 lists some of these Metrics.
Table 7-3. Metrics for dfs Context
Metric Desription
CapacityRemaining Total disk space free in HDFS (in GB)
FilesCreated Number of files created in a cluster
FilesDeleted Number of files deleted in a cluster
FilesRenamed Number of files renamed in a cluster
PercentRemaining Percentage of remaining HDFS capacity (in GB)
TotalBlocks Total number of blocks in a cluster
Transactions_avg_time Average time for a transcation
Transactions_num_ops Number of transactions
The dfs Metrics can be used for security purposes. You can use them to spot unusual activity or sudden change
in activity for your cluster. You can store the daily Metric values (in a Hive table), and calculate an average for last
30 days. Then, if the daily value for a Metric varies by, say, 50% from the average, you can generate an alert. You can
also direct the Metrics output to Ganglia, use Ganglia for aggregation and averaging, and then use Nagios to generate
alerts based on the 50% variation threshold.
The rpc Context
The rpc (remote procedure call) Metrics focus on process details of remote processes. Table 7-4 lists some important
rpcMetrics.
Table 7-4. Metrics for rpc Context
Metric Desription
RpcProcessingTimeNumOps Number of processed RPC requests
RpcAuthenticationFailures Number of failed RPC authentication calls
RpcAuthorizationFailures Number of failed RPC authorization calls
The rpc Metrics can also be used for security purposes. You can use them to spot unusual RPC activity or
sudden changes in RPC activity for your cluster. Again, you can store the daily Metric values in a Hive table (or use
Ganglia) and maintain averages for last 30 days. Then, if the daily value for a Metric varies by a certain percentage
from the average, such as 50%, you can generate an alert. Metrics such as RpcAuthenticationFailures or
RpcAuthorizationFailures are especially important from the security perspective.
123
CHAPTER 7 ■ MONITORING IN HADOOP
The mapred Context
Metrics for mapred (MapReduce) context provide job-related details (for JobTracker/TaskTracker). Table 7-5 lists
some important mapred Metrics.
Table 7-5. Metrics for mapred Context
Metric Desription
jobs_completed Number of jobs that completed successfully
jobs_failed Number of jobs that failed
maps_completed Number of maps that completed successfully
maps_failed Number of maps that failed
memNonHeapCommittedM Non-heap memory that is committed (in MB)
memNonHeapUsedM Non-heap memory that is used (in MB)
occupied_map_slots Number of used map slots
map_slots Number of map slots
occupied_reduce_slots Number of used reduce slots
reduce_slots Number of reduce slots
reduces_completed Number of reducers that completed successfully
reduces_failed Number of reducers that failed
running_1440 Number of long-running jobs (more than 24 hours)
Trackers Number of TaskTrackers available for the cluster
Metrics for mapred context provide valuable information about the jobs that were executed on your cluster.
They can help you determine if your cluster has any performance issues (from a job execution perspective). You can
use a monitoring system (like Ganglia) to make sure that you have enough map and reduce slots available at any time.
Also, you can make sure that you don’t have any long-running jobs—unless you know about them in advance! You
can use Nagios with Ganglia to generate appropriate alerts. Just like the other contexts, mapred Metrics can also be
monitored for unusual job activity (against average job activity).
You can find Hadoop Metrics listed in Appendix D, “Hadoop Metrics and Their Relevance to Security.”Appendix D
also includes an example that explains use of specific Metrics and pattern searches for security (I included the
security-specific configuration for that example, too).
Metrics and Security
Several Metrics can provide useful security information, including the following:
Activity statistics for NameNode: It’s important to monitor the activity on NameNode, as
it can provide a lot of information that can alert you to security issues. Being the “brain” of
a Hadoop cluster, NameNode is hub of all the file creation activity. If the number of newly
created files changes drastically or the number of files whose permissions are changed
increases drastically, the Metrics can trigger alerts so you can investigate.
Activity statistics for a DataNode: For a DataNode, if the number of reads or writes by a local
client increases suddenly, you definitely need to investigate. Also, if the number of blocks
added or removed changes by a large percentage, then Metrics can trigger alerts to warn you.
124
CHAPTER 7 ■ MONITORING IN HADOOP
Activity statistics for RPC-related processing: For the NameNode (or a DataNode), you need
to monitor closely the RPCMetrics, such as the number of processed RPC requests, number of
failed RPC authentication calls, or number of failed RPC authorization calls. You can compare
the daily numbers with weekly averages and generate alerts if the numbers differ by a threshold
percentage. For example, if the number of failed RPC authorization calls for a day is 50 and the
weekly average is 30, then if the alert threshold is 50% or more of the weekly average, an alert
will be generated (50% of 30 is 15, and the daily number (50) is greater than 45).
Activity statistics for sudden change in system resources: It is beneficial to monitor for
sudden changes in any of the major system resources, such as available memory, CPU, or
storage. Hadoop provides Metrics for monitoring these resources, and you can either define
a specific percentage (for generating alerts) or monitor for a percent deviation from weekly or
monthly averages. The later method is more precise, as some of the clusters may never hit the
target alert percentage even with a malicious attack (e.g., if average memory usage for a cluster
is 20% and a malicious attack causes the usage to jump to 60%). If you have defined an alert
threshold of 80% or 90%, then you will never get an alert. Alternatively, if you have defined
your alert threshold for 50% or more (of average usage), then you will definitely get an alert.
You can use a combination of Ganglia and Nagios to monitor sudden changes to any of your system resources or
Metrics values for any of the Hadoop daemons. Again, Appendix D has an example that describes this approach.
If you don’t want to use a monitoring system and want to adopt the “old-fashioned” approach of writing the
Metrics data to files and using Hive or HBase to load that data in tables, that will work, too. You will of course need to
develop shellscripts for scheduling your dataloads, perform aggregations, generate summary reports and generate
appropriate alerts.
Metrics Filtering
When you are troubleshooting a security breach or a possible performance issue, reviewing a large amount of Metrics
data can take time and be distracting and error-prone. Filtering the Metrics data helps you focus on possible issues
and save valuable time. Hadoop allows you to configure Metrics filters by source, context, record, and Metrics.
The highest level for filtering is by source (e.g., DataNode5) and the lowest level of filtering is by the Metric name
(e.g., FilesCreated). Filters can be combined to optimize the filtering efficiency.
For example, the following file sink accepts Metrics from context dfs only:
bcl.sink.file0.class=org.apache.hadoop.metrics2.sink.FileSink
bcl.sink.file0.context=dfs
To set up your filters, you first need to add a snippet like the following in your
$HADOOP_INSTALL/hadoop/conf/hadoop-metrics2.properties file:
# Syntax: <prefix>.(source|sink).<instance>.<option>
*.sink.file.class=org.apache.hadoop.metrics2.sink.FileSink
*.source.filter.class=org.apache.hadoop.metrics2.filter.GlobFilter
*.record.filter.class=${*.source.filter.class}
*.metric.filter.class=${*.source.filter.class}
125
CHAPTER 7 ■ MONITORING IN HADOOP
After this, you can include any of the following configuration options that will set up filters at various levels:
# Thiswill filter out sources with names starting with Cluster2
jobtracker.*.source.filter.exclude=Cluster2*
# This will filter out records with names that match localhost in the source dfs
jobtracker.source.dfs.record.filter.exclude=localhost*
# This will filter out Metrics with names that match cpu* for sink instance file only
jobtracker.sink.file.metric.filter.exclude=cpu*
jobtracker.sink.file.filename=MyJT-metrics.out
So, to summarize, you can thus filter out Metric data by source, by a pattern within a source, or by Metric names
or patterns within an output file for a sink.
Please remember, when you specify an “include” pattern only, the filter only includes data that matches the filter
condition. Also, when you specify an “exclude” pattern only, the matched data is excluded. Most important, when you
specify both of these patterns, sources that don’t match either pattern are included as well! Last, include patterns have
precedence over exclude patterns.
Capturing Metrics Output to File
How do you direct output of NameNode or DataNode Metrics to files? With Metrics2, you can define a sink (output file)
into which to direct output from your Metric source by adding a few lines to the hadoop-metrics2.properties
configuration file in the directory /etc/Hadoop/conf or $HADOOP_INSTALL/hadoop/conf. In the following example,
I am redirecting the NameNode and DataNode Metrics to separate output files as well as the Ganglia monitoring
system (remember, Metrics2 can support output to two sinks at once):
# Following are entries from configuration file hadoop-metrics2.properties
# collectivelythey output Metrics from sources NameNode and DataNode to
# a sink named 'tfile' (output to file) and also to a sink named 'ganglia'
# (outputto Ganglia)
# Defining sink for file output
*.sink.tfile.class=org.apache.hadoop.metrics2.sink.FileSink
# Filename for NameNode output
namenode.sink.tfile.filename = namenode-metrics.log
# Output the DataNode Metrics to a separate file
datanode.sink.tfile.filename =datanode-metrics.log
# Defining sink for Ganglia 3.1
*.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31
126
CHAPTER 7 ■ MONITORING IN HADOOP
# Default polling period for GangliaSink
*.sink.ganglia.period=10
# Directing output to ganglia servers
namenode.sink.ganglia.servers=gangliahost_1:8649,gangliahost_2:8649
datanode.sink.ganglia.servers=gangliahost_1:8649,gangliahost_2:8649
Now that you have all the Metric data in files, you need to make effective use of it. If you don’t plan to use a
monitoring system, you will have to define file sinks (as output) for all the Hadoop daemons and manually analyze
the huge output files or aggregate them as required! At the most, you can define Hive external tables and ease the
processing. Alternatively, you can direct the Metrics output to a JMX console for reviewing it.
Please note that with either of these approaches, you won’t be able to display the Metric data or aggregations
graphically for a quick review. Also, you will need to set up interface with alerting mechanism via shellscripts
(accessing the Hive data) and set up interfaces for paging the system administrators (in case of critical events) as well.
However, if you plan to use Ganglia, sending your Metrics to the Ganglia monitoring system is as simple as
sending them to a file and provides many more advantages, as you’ll learn in the next section.
Security Monitoring with Ganglia and Nagios
The best security monitoring system for your Hadoop cluster is a system that matches your environment and needs.
In some cases, making sure that only authorized users have access may be most important, while in other cases, you
may need to monitor the system resources and raise an immediate alert if a sudden change in their usage occurs.
Some cluster administrators solely want to monitor failed authentication requests. The leaders in Hadoop security
monitoring, Ganglia (http://ganglia.sourceforge.net) and Nagios (www.nagios.org), meet this challenge by
providing flexibility and varied means of monitoring the system resources, connections, and any other part of your
Hadoop cluster that’s technically possible to monitor.
Both are open source tools with different strengths that complement each other nicely. Ganglia is very good at
gathering Metrics, tracking them over time, and aggregating the results; while Nagios focuses more on providing an
alerting mechanism. Since gathering Metrics and alerting are both equally essential aspects of monitoring, Ganglia
and Nagios work best together. Both these tools have agents running on all hosts for a cluster and gather information
via a polling process that can poll any of the hosts to get the necessary information.
Ganglia
Ganglia was designed at the University of California, Berkeley and started as an open source monitoring project
meant to be used with large distributed systems. Ganglia’s open architecture makes it easy to integrate with other
applications and gather statistics about their operations. That’s the reason Ganglia can receive and process output
data from Hadoop Metrics with ease and use it effectively.
For a monitored cluster, each host runs a daemon process called gmond that collects and broadcasts the local
Metrics data (like CPU usage, memory usage, etc.) to all the hosts within the cluster. A polling process (gmetad) can
then query any of the hosts, read all the Metrics data and route it to a central monitoring server. The central host can
display the Metrics, aggregate them, or summarize them for further use. Gmond has little overhead and hence can
easily be run on every machine in the cluster without affecting user performance. Ganglia’s web interface can easily
display the summary usage for last hour, day, week, or month as you need. Also, you can get details of any of these
resource usages as necessary.
127
CHAPTER 7 ■ MONITORING IN HADOOP
Ganglia Architecture

Broadly, Ganglia has four major components: gmond, gmetad, rrdtool and gweb. gmond runs on all the nodes in a
cluster and gathers Metrics data, gmetad polls the data from gmond, rrdtool stores the polled data, and gweb is the
interface that provides visualization and analysis for the stored data. Figure 7-3 illustrates how Ganglia’s components
fit into the basic Hadoop distributed monitoring system shown in Figure 7-2.
Node A
gmond
Node C
gmond
gmetad
RRDtool
Gmetad polls from a single node within a cluster;
polling process can run on any node within the
cluster or off it
Node B
gmond
Node C
gmond
gweb
Nodes transmit and receive monitoring
data from each other
Graphical user interface
to display consolidated
monitoring output
Figure 7-3. Ganglia monitoring system for Hadoop
Take a closer look at what each of the Ganglia components does:
gmond: gmond needs to be installed on every host you want monitored. It interacts with
the host operating system to acquire Metrics such as load Metrics (e.g., average cluster load),
process Metrics (e.g., total running processes) or rpc Metrics (e.g., RpcAuthenticationFailures).
It is modular and uses operating system–specific plugins to take measurements. Since only
the necessary plugins are installed at compile time, gmond has a very small footprint and
negligible overhead.
gmond is not invoked as per request from an external polling engine (for measurement), but
rather polls according to a schedule defined by a local configuration file. Measurements are
shared with other hosts (from the cluster) via a simple listen/announce protocol broadcasted
at the same multicast address. Every gmond host also records the Metrics it receives from
other hosts within the cluster.
Therefore, every host in a Ganglia cluster knows the current value of every Metric recorded
by every other host in the same cluster. That’s the reason only one host per cluster needs
to be polled to get Metrics of the entire cluster, and any individual host failures won’t affect
the system at all! Also, this design reduces the number of hosts that need to be polled
exponentially, and hence is easily scalable for large clusters.
128
CHAPTER 7 ■ MONITORING IN HADOOP
gmetad: gmetad is the polling process within the Ganglia monitoring system. It needs a list of
hostnames that specifies at least one host per cluster. An XML-format dump of the Metrics for
a cluster can be requested by gmetad from any host in the cluster on port 8649, which is how
gmetad gets Metrics data for a cluster.
RRDtool: RRDtool is the Ganglia component used for storing the Metrics data polled by
gmetad from any of the cluster hosts. Metrics are stored in “round-robin” fashion; when
no space remains to store new values, old values are overwritten. As per the specified data
retention requirements, RRDtool aggregates the data values or “rolls them up.” This way of
data storage allows us to quickly analyze recent data as well as maintain years of historical
data using a small amount of disk space. Also, since all the required disk space is allocated in
advance, capacity planning is very easy.
gweb: gweb is the visualization interface for Ganglia. It provides instant access to any
Metric from any host in the cluster without specifying any configuration details. It visually
summarizes the entire grid using graphs that combine Metrics by cluster and provides
drop-downs for additional details. If you need details of a specific host or Metric, you can
specify the details and create a custom graph of exactly what you want to see.
gweb allows you to change the time period in graphs, supports extracting data in various
textual formats (CSV, JSON, and more), and provides a fully functional URL interface so that
you can embed necessary graphs into other programs via specific URLs. Also, gweb is a PHP
program, which is run under the Apache web server and is usually installed on the same
physical hardware as gmetad, since it needs access to the RRD databases created by gmetad.
Configuring and Using Ganglia
With a clearer understanding of Ganglia’s major components, you’re ready to set it up and put it to work for
security-related monitoring and outputting specific Hadoop Metrics.
To install Ganglia on a Hadoop cluster you wish to monitor, perform the following steps:
Install Ganglia components gmetad, gmond, and gweb on one of the cluster nodes or
hosts. (For my example, I called the host GMaster).
Install Ganglia component gmond on all the other cluster nodes.
The exact command syntax or means of install will vary according to the operating system you use. Please refer to
the Ganglia installation instructions for specifics. In all cases, however, you will need to modify configuration files for
Ganglia to work correctly and also for Hadoop to output Metrics through Ganglia as expected (the configuration files
gmond.conf, gmetad.conf, and hadoop-metrics2.properties need to be modified).
To begin, copy gmond.conf (with the following configuration) to all the cluster nodes:
/* the values closely match ./gmond/metric.h definitions in 2.5.x */
globals {
daemonize = yes
setuid =yes
user = nobody
debug_level = 0
max_udp_msg_len = 1472
mute =no
deaf = no
allow_extra_data =yes
129
CHAPTER 7 ■ MONITORING IN HADOOP
host_dmax = 86400 /*secs. Expires hosts in 1 day */
host_tmax =20 /*secs */
cleanup_threshold = 300 /*secs */
gexec =no
send_metadata_interval = 0 /*secs */
}
/*
The cluster attributes specified will be used as part of the <CLUSTER>
tag that will wrap all hosts collected by this instance.
*/
cluster {
name = "pract_hdp_sec"
owner ="Apress"
latlong = "N43.47 E112.34"
url = "http://www.apress.com/9781430265443"
}
/* The host section describes attributes of the host, like the location */
host {
location = "Chicago"
}
/* Feel free to specify as many udp_send_channels as you like */
udp_send_channel {
bind_hostname = yes #soon to be default
mcast_join =239.2.11.71
port = 8649
ttl = 1
}
/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
mcast_join = 239.2.11.71
port = 8649
bind = 239.2.11.71
retry_bind = true
}
/* You can specify as many tcp_accept_channels as you like to share
an xmldescription of the state of the cluster */
tcp_accept_channel {
port =8649
}
/* Each Metrics module that is referenced by gmond must be specified and
loaded. Ifthe module has been statically linked with gmond, it does
not require a load path. However all dynamically loadable modules must
include a load path. */
130
CHAPTER 7 ■ MONITORING IN HADOOP
modules {
module {name = "core_metrics"}
module {name = "cpu_module" path = "modcpu.so"}
module {name= "disk_module" path = "moddisk.so"}
module {name = "load_module" path = "modload.so"}
module {name = "mem_module" path = "modmem.so"}
module {name = "net_module" path = "modnet.so"}
module {name = "proc_module" path = "modproc.so"}
module {name = "sys_module" path = "modsys.so"}
}
In the Globals section, the daemonize attribute, when true, will make gmond run as a background process.
A debug_level greater than 0 will result in gmond running in the foreground and outputting debugging information.
The mute attribute, when true, will prevent gmond from sending any data, and the deaf attribute, when true, will
prevent gmond from receiving any data. If host_dmax is set to a positive number, then gmond will flush a host after it
has not heard from it for host_dmax seconds. The cleanup_threshold is the minimum amount of time before gmond
will cleanup any hosts or Metrics with expired data. The send_metadata_interval set to 0 means that gmond will only
send the metadata packets at startup and upon request from other gmond nodes running remotely.
Several Ganglia Metrics detect sudden changes in system resources and are well suited for security monitoring:
cpu_aidle (percentage of CPU cycles idle since last boot; valid for Linux)
cpu_user (percentage of CPU cycles spent executing user processes)
load_five (reported system load, averaged over five minutes)
mem_shared (amount of memory occupied by system and user processes)
proc_run (total number of running processes)
mem_free (amount of memory free)
disk_free (total free disk space)
bytes_in (number of bytes read from all non-loopback interfaces)
bytes_out (number of bytes written to all non-loopback interfaces)
You can add them to your gmond.conf file in the following format:
collection_group {
collect_every =40
time_threshold = 300
metric {
name = "bytes_out"
value_threshold =4096
title = "Bytes Sent"
}
}
As you can see in the example, Metrics that need to be collected and sent out at the same interval can be grouped
under the same collection_group. In this example, collect_every specifies the sampling interval, time_threshold
specifies the maximum data send interval (i.e., data is sent out at that interval), and value_threshold is Metric
variance threshold (i.e., value is sent if it exceeds the value_threshold value).
131
CHAPTER 7 ■ MONITORING IN HADOOP
The second configuration file is gmetad.conf, which needs to reside on the host (GMaster) only. Keep in mind
that the code that follows is only an example, and you can set up your own data sources or change settings as you
need for round-robin archives:
# Format:
# data_source "my cluster" [polling interval] address1:port addreses2:port ...
#
data_source "HDPTaskTracker" 50 localhost:8658
data_source "HDPDataNode"50 localhost:8659
data_source "HDPNameNode" 50 localhost:8661
data_source "HDPJobTracker" 50 localhost:8662
data_source "HDPResourceManager" 50 localhost:8664
data_source "HDPHistoryServer" 50 localhost:8666
#
# Round-Robin Archives
# You can specify custom Round-Robin archives here
#
RRAs "RRA:AVERAGE:0.5:1:244" "RRA:AVERAGE:0.5:24:244" RRA:AVERAGE:0.5:168:244"
"RRA:AVERAGE:0.5:672:244" "RRA:AVERAGE:0.5:5760:374"
#
# The name of this Grid. All the data sources above will be wrapped in a GRID
# tagwith this name.
# default: unspecified
gridname "HDP_GRID"
#
# In earlier versions of gmetad, hostnames were handled in a case
# sensitive manner. If your hostname directories have been renamed to lower
# case,set this option to 0 to disable backward compatibility.
# From version 3.2, backwards compatibility will be disabled by default.
# default:1 (for gmetad < 3.2)
# default: 0 (for gmetad >= 3.2)
case_sensitive_hostnames 1
Last, you need to customize the hadoop-metrics2.properties configuration file in the directory /etc/Hadoop/conf
or $HADOOP_INSTALL/hadoop/conf. You can define appropriate sources (in this case, either the dfs, jvm, rpc, or
mapred Metrics), sinks (just Ganglia or a combination of Ganglia and output files), and filters (to filter out Metrics
data that you don’t need).
To set up your sources and sinks, use code similar to the following:
# syntax: [prefix].[source|sink|jmx].[instance].[options]
# See package.html for org.apache.hadoop.metrics2 for details
*.period=60
*.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31
*.sink.ganglia.period=10
# default for supportsparse is false
*.sink.ganglia.supportsparse=true
132
CHAPTER 7 ■ MONITORING IN HADOOP
.sink.ganglia.slope=jvm.metrics.gcCount=zero,jvm.metrics.memHeapUsedM=both
.sink.ganglia.dmax=jvm.metrics.threadsBlocked=70,jvm.metrics.memHeapUsedM=40
# Associate sinks with server and ports
namenode.sink.ganglia.servers=localhost:8661
datanode.sink.ganglia.servers=localhost:8659
jobtracker.sink.ganglia.servers=localhost:8662
tasktracker.sink.ganglia.servers=localhost:8658
maptask.sink.ganglia.servers=localhost:8660
reducetask.sink.ganglia.servers=localhost:8660
resourcemanager.sink.ganglia.servers=localhost:8664
nodemanager.sink.ganglia.servers=localhost:8657
historyserver.sink.ganglia.servers=localhost:8666
resourcemanager.sink.ganglia.tagsForPrefix.yarn=Queue
Setting supportsparse to true helps in reducing bandwidth usage. Otherwise the Metrics cache is updated
every time the Metric is published and that can be CPU/network intensive. Ganglia slope can have values of zero
(the Metric value always remains the same), positive (the Metric value can only be increased), negative
(the Metric value can only be decreased), or both (the Metric value can either be increased or decreased). The dmax
value indicates how a long a particular value will be retained. For example, the value for JVM Metric threadsBlocked
(from the preceding configuration) will be retained for 70 seconds only.
As I discussed earlier in the “Metrics Filtering,” section, filters are useful in situations where you are
troubleshooting or need to focus on a known issue and need specific Metric data only. Of course, you can limit the
Metrics data you are capturing through settings in gmond.conf (as you learned earlier in this section), but filters can
be useful when you need Metric data limited (or captured) temporarily—and quickly!
Monitoring HBase Using Ganglia
Ganglia can be used to monitor HBase just as you have seen it used for monitoring Hadoop. There is a configuration
file called hadoop-metrics.properties located in directory $HBASE_HOME/conf (where $HBASE_HOME is the HBase
install directory). You need to configure all the “contexts” for HBase to use Ganglia as an output:
# Configuration of the "hbase" context for Ganglia
hbase.class=org.apache.hadoop.metrics.ganglia.GangliaContext
hbase.period=60
hbase.servers=localhost:8649
# Configuration of the "jvm" context for Ganglia
jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext
jvm.period=60
hbase.servers=localhost:8649
# Configuration of the "rpc" context for Ganglia
rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext
rpc.period=60
hbase.servers=localhost:8649
For the hbase context, you can see values for metrics like averageLoad (average number of regions served by each
region server) or numRegionServers (number of online region servers) on the HBase master server.
133
CHAPTER 7 ■ MONITORING IN HADOOP
Also, for the jvm context, you can see Metrics like MemHeapUsedM (heap memory used, in MB) and
MemHeapCommittedM (heap memory committed, in MB). If more than one jvm is running (i.e., more than one
HBase process) Ganglia aggregates the Metrics values instead of reporting them per instance.
This concludes the HBase monitoring section. I have listed all the HBase Metrics in Appendix D for your reference.
Before I conclude the discussion about Ganglia, I want you to have a quick look at the Ganglia web interface.
Please review Figure 7-4. It shows the Ganglia dashboard displaying summary graphs for the previous month.
You can see the average and maximum load, CPU usage, memory usage, and network usage. From the dashboard you
can select detailed graphs for any of these resources or create custom graphs for the specific Metrics you need.

Figure 7-4. Ganglia dashboard
■ Note Ganglia is available at http://ganglia.sourceforge.net/. Plug-ins for Ganglia are available at
https://github.com/ganglia/. The user community URL for Ganglia is: http://ganglia.info/?page_id=67.
Nagios
Nagios is a specialized scheduling and notification engine. It doesn’t monitor any processes or resources, but instead
schedules execution of plug-ins (executable programs—Nagios plug-ins are not the same as Hadoop Metrics plug-ins)
and takes action based on execution status. For example, status 0 is Success, 1 is Warning, 2 is Critical, and 3 is
Unknown. You can configure the Nagios service to map specific actions for each of these outputs for all the plug-ins
defined within the configuration files. In addition, you can define your own plug-ins and define the frequency for
monitoring them as well as actions mapped to each of the possible outputs.
In addition to codes, the plug-ins can also return a text message, which can be written to a log and also be
displayed on the web interface. If the text message contains a pipe character, the text after it is treated as performance
data. The performance data contains Metrics from the monitored hosts and can be passed to external systems
(like Ganglia) for use.
134
CHAPTER 7 ■ MONITORING IN HADOOP
Most of the time, Nagios is used for monitoring along with Ganglia. The reason is that both these open source
tools complement each other nicely, since they have different strengths. For example, Ganglia is more focused on
gathering Metrics and tracking them over a time period, while Nagios focuses more on being an alerting mechanism.
Since gathering Metrics and alerting are both essential aspects of monitoring, they work best in conjunction. Both
Ganglia and Nagios have agents running on all hosts for a cluster and gather information.
Getting back to Nagios, let me start with Nagios architecture.
Architecture
The Nagios daemon or service runs on a host and has plug-ins running on all the remote hosts that need to be
monitored. (To integrate Nagios with Ganglia, be sure the Ganglia process gmond is running on every host that has
a Nagios plug-in running). The remote Nagios plug-ins send information and updates to the Nagios service, and the
Nagios web interface displays it. When issues are detected, the Nagios daemon notifies predefined administrative
contacts using email or page (text message sent to a phone). Historical log data is available in a log file defined in the
configuration file. As you can see in Figure 7-5, the Nagios monitoring system has three major components:
Server: The server is responsible for managing and scheduling plug-ins. At regular intervals,
the server checks the plug-in status and performs action as per the status. In case of alerts,
configured administrative resources are notified.
Plug-ins: Nagios provides a standard set of user-configurable plug-ins, plus you can add more
as required. Plug-ins are executable programs (mostly written in C, Java, Python, etc). that
perform a specific task and return a result to the Nagios server.
Browser interface of Nagios: These are web pages generated by CGI that display summary
information about monitored resources.
Node A
CPU Metrics
Nagios server schedules checks for
monitored hosts (through local or remote
plugins), processes results and sends alerts


![]()
![]()

Nagios
Server
Node B
Disk I/O
Apache
Remote plugins
Node C
Memory Metrics
Monitored hosts
Graphical user interface
to display status of
monitored resources
Figure 7-5. Nagios architecture
135
CHAPTER 7 ■ MONITORING IN HADOOP
Note Nagios is freely available at http://www.nagios.org. You can download official Nagios plug-ins from the
Nagios Plug-In Development Team at http://nagiosplug.sourceforge.net. In addition, the Nagios community is
continuously developing new plug-ins, which you can find at http://exchange.nagios.org.
Although using Ganglia and Nagios in conjunction is an effective approach to security monitoring, the applications
are not integrated by default. You need to integrate them through plug-ins, as the next section explains.
Nagios Integration with Ganglia
Nagios has no built-in Metrics. Remote or local plug-ins are executed and their status compared by Nagios with
user-specified status/notification mapping to perform any necessary notification tasks. Services like NRPE
(Nagios Remote Plugin Executor) or NSCA (Nagios Service Check Acceptor) are used for remote executions. If you’re
using Ganglia for monitoring, however, all the Metrics Nagios needs (for CPU, memory, disk I/O, etc.) are already
available. You simply have to point Nagios at Ganglia to collect these Metrics! To help you, as of version 2.2.0
the Ganglia project started including a number of official Nagios plug-ins in its gweb versions (for details, see
https://github.com/ganglia/ganglia-web/wiki/Nagios-Integration). In Nagios, you can then use these
plug-ins to create commands and services to compare Metrics captured (or generated) by Ganglia against alert
thresholds defined in Nagios.
Originally, five Ganglia plug-ins were available:
check_heartbeat (check heartbeat to verify if the host is available)
check_metric (check a single Metric on a specific host)
check_multiple_metrics (check multiple Metrics on a specific host)
check_host_regex (check multiple Metrics across a regex-defined range of hosts)
check_value_same_everywhere (check value or values are the same across a set of hosts)
Now, the current Ganglia web tarball (version 3.6.2) contains 10 plug-ins for Nagios integration! You can download
it at http://sourceforge.net/projects/ganglia/files/ganglia-web/3.6.2/ to check out the five new plugins.
Using Ganglia’s Nagios Plug-ins
When extracted, the Ganglia web tarball contains a subdirectory called nagios that contains the shellscripts as well
as PHP scripts for each of the plug-ins. The shellscript for a plug-in accepts values for parameters and passes them
on to the corresponding PHP script. The PHP script processes the values and uses an XML dump of the grid state
(state of the cluster containing details of all the Metrics; obtained by gmetad) to acquire current Metric values as per
the request. A return code (indicating the status of request) is passed back to Nagios. Figure 7-6 illustrates the process.
136
CHAPTER 7 ■ MONITORING IN HADOOP
Gmetad provides
the actual Metrics
GWeb
Server
Nagios
Server
Gmetad
daemon
Plug-in.php with
input parameter
values passed
through Plug -
in.sh
Runs Plug -in.sh
with input
parameter
values specified
Gweb server runs
the PHP script for
the plug-in
Nagios server runs the
shell-script for plug-in
XML dump of
grid state cache
Figure 7-6. Ganglia-Nagios integration processing
Remember to enable the server-side PHP script functionality before using it and to verify the following parameter
values in configuration file conf.php (used by gweb):
$conf['nagios_cache_enabled'] = 1;
$conf['nagios_cache_file']= $conf['conf_dir']."/nagios_ganglia.cache";
$conf['nagios_cache_time'] = 45;
The location of conf.php varies as per the operating system, Hadoop distribution, and other factors. Your best
option is to use the find command:
find / -name conf.php –print
The steps to follow for using Nagios as a scheduling and alerting mechanism for any of the five Ganglia plug-ins
are very similar. Therefore, I will demonstrate the process with two of the plug-ins: check_heartbeat and
check_multiple_metrics. I also will assume you have installed Ganglia, PHP, and Nagios and you are using the
Hortonworks Hadoop distribution.
The check_heartbeat plug-in is a heartbeat counter used by Ganglia to make sure a host is functioning
normally. This counter is reset every time a new Metric packet is received for the host. To use this plug-in with
Nagios, first copy the check_heartbeat.sh script from the Nagios subdirectory in the Ganglia web tarball (in my
case, /var/www/html/ganglia/nagios) to your Nagios plug-ins directory (in my case, /usr/lib64/nagios/plugins).
Make sure that the GANGLIA_URL inside the script is correct. Substitute your localhost name and check if
http://localhost/ganglia takes you to the Ganglia homepage for your installation. Then check if this is the
setting in check_heartbeat.sh:
GANGLIA_URL=http://<localhost>/ganglia/nagios/check_heartbeat.php
137
CHAPTER 7 ■ MONITORING IN HADOOP
At this point, you might also want to verify if PHP command line installation on your Nagios server is functional;
you can do that by running the php –version command. You should see a response similar to the following:
PHP 5.3.3 (cli) (built: Aug 6 2014 05:54:27)
Copyright (c) 1997-2010 The PHP Group
Zend Engine v2.3.0, Copyright (c) 1998-2010 Zend Technologies
Run the plug-in script and verify it provides the heartbeat status correctly:
./check_heartbeat.sh host=pract_hdp_sec threshold=75
OK Lastbeacon received 0 days, 0:00:07
Next, define this plug-in as a command for Nagios (see the sidebar “Nagios Commands and Macros” for details).
The threshold is the amount of time since the last reported heartbeat; that is, if the last packet received was 50 seconds
ago, you would specify 50 as the threshold:
define command {
command_name check_ganglia_heartbeat
command_line $USER1$/check_heartbeat.sh host=$HOSTADDRESS$ threshold=$ARG1$
}
Note the use of the macros $HOSTADDRESS$ (substituted to IPaddress of the host), $USER1$ (user-defined macro
defined in a resource file), and $ARG1$ (first argument to the command). Using macros provides the information
contained in them automatically to a command (since the referenced value is available). So, the command
check_ganglia_heartbeat can be used for checking the heartbeat on any host within your cluster. Similarly, the
argument value passed to this command lets you change that parameter at runtime. Please refer to the sidebar
“Nagios Commands and Macros” for further details about macros.
NAGIOS COMMANDS AND MACROS
For Nagios, a command can be defined to include service checks, service notifications, service event handlers,
host checks, host notifications, and host event handlers. Command definitions can contain macros that are
substituted at runtime; this is one of the main features that makes Nagios flexible (please refer to
http://nagios.sourceforge.net/docs/3_0/macros.html for more information on macros).
Macros can provide information from hosts, services, and other sources. For example, $HOSTNAME$ or
$HOSTADDRESS$ are frequently used macros. Macros can also pass arguments using $ARGn$ (n th argument
passed to a command). Nagios supports up to 32 argument macros ($ARG1$ through $ARG32$). The syntax for
defining a command is as follows:
define command{
command_name<command_name>
command_line<command_line>
}
where <command_name> is the name of the command and <command_line> is what Nagios actually executes
when the command is used.
138
CHAPTER 7 ■ MONITORING IN HADOOP
You can define the commands in the Nagios main configuration file called nagios.cfg. Most of the time the file
resides in /etc/nagios, but location may vary for your install. The main configuration file defines individual object
configuration files for commands, services, contacts, templates, and so forth. In addition, there may be a specific
section for Hadoop servers. For example, the Hortonworks nagios.cfg has the following section:
# Definitions for hadoop servers
cfg_file=/etc/nagios/objects/hadoop-hosts.cfg
cfg_file=/etc/nagios/objects/hadoop-hostgroups.cfg
cfg_file=/etc/nagios/objects/hadoop-servicegroups.cfg
cfg_file=/etc/nagios/objects/hadoop-services.cfg
cfg_file=/etc/nagios/objects/hadoop-commands.cfg
I will define the command check_ganglia_heartbeat in configuration file /etc/nagios/objects/
hadoop-commands.cfg. The last step is defining a service for Nagios. Within Nagios, use of the term service is very
generic or nonspecific. It may indicate an actual service running on the host (e.g., POP, SMTP, HTTP, etc.) or some
other type of Metric associated with the host (free disk space, CPUusage, etc.). A service is defined in configuration
file /etc/nagios/objects/hadoop-services.cfg and has the following syntax:
define service {
host_name localhost
use hadoop-service
service_description GANGLIA::Ganglia Check Heartbeat
servicegroups GANGLIA
check_command check_ganglia_heartbeat!50
normal_check_interval 0.25
retry_check_interval 0.25
max_check_attempts 4
}
Please note that check_command indicates the actual command that would be executed on the specified host.
The parameter normal_check_interval indicates the number of time units to wait before scheduling the next
check of the service. One time unit is 60 seconds (that’s the default), and therefore 0.25 indicates 15 seconds.
retry_check_interval defines the number of time units to wait before scheduling a recheck of the service if it has
changed to a non-okaystate, and max_check_attempts indicates the number of retries in such a situation.
The command check_multiple_metrics checks multiple Ganglia Metrics and generates a single alert. To use it,
copy the check_multiple_metrics.sh script from the Nagios subdirectory in the Ganglia web tarball to your Nagios
plug-ins directory. Make sure that GANGLIA_URL inside the script is set to http://localhost/ganglia/nagios/
check_heartbeat.php, and also remember to substitute localhost with the appropriate host name.
Define the corresponding command check_ganglia_multiple_metrics in the configuration file
/etc/nagios/objects/hadoop-commands.cfg:
define command {
command_name check_ganglia_multiple_metrics
command_line $USER1$/check_multiple_metrics.sh host=$HOSTADDRESS$ checks='$ARG1$'
}
139
CHAPTER 7 ■ MONITORING IN HADOOP
You can add a list of checks delimited with a colon. Each check consists of Metric_name,operator,critical_value.
Next, define a corresponding service in the configuration file /etc/nagios/objects/hadoop-services.cfg:
define service {
host_name localhost
use hadoop-service
service_description GANGLIA::Ganglia check Multiple Metric service
servicegroups GANGLIA
check_command check_ganglia_multiple_metrics!disk_free,less,10:load_one,more,5
normal_check_interval 0.25
retry_check_interval 0.25
max_check_attempts 4
}
Note the check_command section that defines the command to be executed:
check_ganglia_multiple_metrics!disk_free,less,10:load_one,more,5.
This indicates that an alert will be generated if free disk space (for the host) falls below 10GB or if 1-minute load
average goes over 5.
After successfully defining your Ganglia plug-ins, you can use the Nagios web interface to check and manage
these plug-ins. As you can see in Figure 7-7, the new check_heartbeat and check_multiple_metrics plug-ins are
already in place and being managed by Nagios.

Figure 7-7. Nagios web interface with plug-ins
If you’d like more practice, you can follow the same steps and add the other three plug-ins.
140
The Nagios Community
CHAPTER 7 ■ MONITORING IN HADOOP
The real strength of Nagios is in its active user community that’s constantly working towards making a more effective
use of Nagios and adding plug-ins to enhance its functionality. To see the latest plug-ins your fellow users have
developed, visit the community page at http://exchange.nagios.org/directory/Plugins. For security purposes,
you’ll find many plug-ins that you can use effectively, such as:
check_ssh_faillogin: Monitors the ssh failed login attempts; available at
http://exchange.nagios.org/directory/Plugins/Security/check_ssh_faillogin/details.
show_users: Shows logged users. Can alert on certain users being logged in using a whitelist,
blacklist, or both. Details at: http://exchange.nagios.org/directory/Plugins/
check_long_running_procs.sh: Checks long-running processes; available at
http://exchange.nagios.org/directory/Plugins/System-Metrics/Processes/
Check-long-running-processes/details.
You can use the same process as you followed for using the Ganglia plug-ins to use any new plug-in. You will
need to copy it to the Nagios plug-ins directory, then define a command and service. Of course, follow any specific
install instructions for individual plug-ins or install any additional packages that are required for their functioning.
Summary
In this chapter, I have discussed monitoring for Hadoop as well as popular open-source monitoring tools. Remember,
monitoring involves a good understanding of both the resources that need to be monitored and the environment
that you plan to monitor. Though I can tell you what needs to be monitored for a Hadoop cluster, you know your
environment’s individual requirements best. I have tried to provide some general hints, but from my experience,
monitoring is always as good as your own system administrator’s knowledge and understanding of your environment.
The “relevance” (how “up to date” or “state of the art” a system is) is also a very valid consideration. You have
to be conscious on a daily basis of all the innovations in your area of interest (including the malicious attacks) and
tune your monitoring based on them. Remember, the best system administrators are the ones who are most alert and
responsive.
Last, please try to look beyond the specific tools and version numbers to understand the principles and
intentions behind the monitoring techniques described in this chapter. You may not have access to the same tools to
monitor, but if you follow the principles, you will be able to set up effective systems for monitoring—and in the end,
that’s both our goal.
141
PART IV
Encryption for Hadoop
CHAPTER 8
Encryption in Hadoop
Recently, I was talking with a friend about possibly using Hadoop to speed up reporting on his company’s “massive”
data warehouse of 4TB. (He heads the IT department of one of the biggest real estate companies in the Chicago
area.) Although he grudgingly agreed to a possible performance benefit, he asked very confidently, “But what about
encrypting our HR [human resources] data? For our MS SQL Server–based HR data, we use symmetric key encryption
and certificates supplemented by C# code. How can you implement that with Hadoop?”
As Hadoop is increasingly used within corporate environments, a lot more people are going to ask the same
question. The answer isn’t straightforward. Most of the Hadoop distributions now have Kerberos installed and/or
implemented and include easy options to implement authorization as well as encryption in transit, but your options
are limited for at-rest encryption for Hadoop, especially with file-level granularity.
Why do you need to encrypt data while it’s at rest and stored on a disk? Encryption is the last line of defense when
a hacker gets complete access to your data. It is a comforting feeling to know that your data is still going to be safe,
since it can’t be decrypted and used without the key that scrambled it. Remember, however, that encryption is used
for countering unauthorized access and hence can’t be replaced by authentication or authorization (both of which
control authorized access).
In this chapter, I will discuss encryption at rest, and how you can implement it within Hadoop. First, I will
provide a brief overview of symmetric (secret key) encryption as used by the DES and AES algorithms, asymmetric
(public key) encryption used by the RSA algorithm, key exchange protocols and certificates, digital signatures, and
cryptographic hash functions. Then, I will explain what needs to be encrypted within Hadoop and how, and discuss
the Intel Hadoop distribution, which is now planned to be offered partially with Cloudera’s distribution and is also
available open source via Project Rhino. Last, I will discuss how to use Amazon Web Services’s Elastic MapReduce (or
VMs preinstalled with Hadoop) for implementing encryption at rest.
Introduction to Data Encryption
Cryptography can be used very effectively to counter many kinds of security threats. Whether you call the data
scrambled, disguised, or encrypted, it cannot be read, modified, or manipulated easily. Luckily, even though
cryptography has its origin in higher mathematics, you do not need to understand its mathematical basis in order
to use it. Simply understand that a common approach is to base the encryption on a key (a unique character pattern
used as the basis for encryption and decryption) and an algorithm (logic used to scramble or descramble data, using
the key as needed). See the “Basic Principles of Encryption” sidebar for more on the building blocks of encryption.
145
CHAPTER 8 ■ ENCRYPTION IN HADOOP
BASIC PRINCIPLES OF ENCRYPTION
As children, my friends and I developed our own special code language to communicate in school. Any messages
that needed to be passed around during class contained number sequences like “4 21 8 0 27 18 24 0 6 18 16
12 17 10” to perplex our teachers if we were caught.
Our code is an example of a simple substitution cipher in which numbers (signifying position within the alphabet)
were substituted for letters and then a 3 was added to each digit; 0 was used as a word separator. So, the above
sequence simply asked the other guy “are you coming”? While our code was very simple, data encryption in
real-world applications uses complex ciphers that rely on complex logic for substituting the characters. In some
cases, a key, such as a word or mathematical expression, is used to transpose the letters. So, for example, using
“myword” as a key, ABCDEFGHIJKLMNOPQRSTUVWXYZ could map to mywordabcefghijklnpqstuvwxyz, meaning
the cipher text for the phrase “Hello world” would be “Brggj ujngo”. To add complexity, you can substitute the
position of a letter in the alphabet for x in the expression 2x + 5 26 to map ABCDEFGHIJKLMNOPQRSTUVWXYZ
to gikmoqsuwyabcdefhjlnprtvxz. Complex substitution ciphers can be used for robust security, but a big issue is
the time required to encrypt and decrypt them.
The other method of encryption is transposition (also called reordering, rearranging, or permutation). A
transposition is an encryption where letters of the original text are rearranged to generate the encrypted text. By
spreading the information across the message, transposition makes the message difficult to comprehend. A very
simple example of this type of encryption is columnar transposition, which involves transposing rows of text to
columns. For example, to transpose the phrase “CAN YOU READ THIS NOW” as a six-column transposition, I could
write the characters in rows of six and arrange one row after another:
C A N Y O U
R EA D T H
I S N O W
The resulting cipher text would then be read down the columns as: “cri aes nan ydo otw uh”. Because of the
storage space needed and the delay involved in decrypting the cipher text, this algorithm is not especially
appropriate for long messages when time is of the essence.
Although substitution and transposition ciphers are not used alone for real-world data encryption, their
combination forms a basis for some widely used commercial-grade encryption algorithms.
Popular Encryption Algorithms
There are two fundamental key-based encryptions: symmetric and asymmetric. Commonly called secret keys,
symmetric algorithms use the same key for encryption as well as decryption. Two users share a secret key that they
both use to encrypt and send information to the other as well as decrypt information from the other—much as my
childhood friends and I used the same number-substitution key to encode the notes we passed in class. Because a
separate key is needed for each pair of users who plan to use it, key distribution is a major problem in using symmetric
encryption. Mathematically, n users who need to communicate in pairs require n × (n – 1)/2 keys. So, the number of
keys increases almost exponentially with number of users. Two popular algorithms that use symmetric key are DES
and AES (more on these shortly).
146
CHAPTER 8 ■ ENCRYPTION IN HADOOP
Asymmetric or public key systems don’t have the issues of key distribution and exponential number of keys.
A public key can be distributed via an e-mail message or be copied to a shared directory. A message encrypted using
it can be decrypted using the corresponding private key, which only the authorized user possesses. Since a
user (within a system) can use any other user’s public key to encrypt a message meant for him (that user has a
corresponding private key to decrypt it), the number of keys remains small—two times the number of users. The
popular encryption algorithm RSA uses public key. Public key encryption, however, is typically 10,000 times slower
than symmetric key encryption because the modular exponentiation that public key encryption uses involves
multiplication and division, which is slower than the bit operations (addition, exclusive OR, substitution, shifting)
that symmetric algorithms use. For this reason, symmetric encryption is used more commonly, while public key
encryption is reserved for specialized applications where speed is not a constraint. One place public key encryption
becomes very useful is symmetric key exchange: it allows for a protected exchange of a symmetric key, which can then
be used to secure further communications.
Symmetric and asymmetric encryptions, and DES, AES, and RSA in particular, are used as building blocks to
perform such computing tasks as signing documents, detecting a change, and exchanging sensitive data, as you’ll
learn in the “Applications of Encryption” section. For now, take a closer look at each of these popular algorithms.
Data Encryption Standard (DES)
Developed by IBM from its Lucifer algorithm, the data encryption standard (DES) was officially adopted as a US
federal standard in November 1976 for use on all public- and private-sector unclassified communication. The DES
algorithm is a complex combination of two fundamental principles of encryption: substitution and transposition. The
robustness of this algorithm is due to repeated application of these two techniques for a total of 16 cycles. The DES
algorithm is a block algorithm, meaning it works with a 64-bit data block instead of a stream of characters. It splits an
input data block in half, performs substitution on each half separately, fuses the key with one of the halves, and finally
swaps the two halves. This process is performed 16 times and is detailed in the “DES Algorithm” sidebar.
DES ALGORITHM
For the DES algorithm, the first cycle of encryption begins when the first 64 data bits are transposed by initial
permutation. First, the 64 transposed data bits are divided into left and right halves of 32 bits each. A 64-bit
key (56 bits are used as the key; the rest are parity bits) is used to transform the data bits. Next, the key gets a
left shift by a predetermined number of bits and is transposed. The resultant key is combined with the right half
(substitution) and the result is combined with the left half after a round of permutation. This becomes the new
right half. The old right half (one before combining with key and left half) becomes the new left half. This cycle
(Figure 8-1) is performed 16 times. After the last cycle is completed, a final transposition (which is the inverse of
the initial permutation) is performed to complete the algorithm.
147
CHAPTER 8 ■ ENCRYPTION IN HADOOP
Input data block
(64 bit)
56 bit key
2
![]()
![]()
![]()
1
![]()
Transposed and
split into 2
halves
3
Modified key
Substitution using
modified key
Gets a left
shift and is
transposed
Left
half
Right
half
![]()
6
Old Right half
becomes the New
Left half
4 One more
round of
Transposition
5
![]()
Left half and the modified
right half combine to form
the new right half
![]()
New Left
half
New
Right half
7
Cycles continue
15 more times……
Figure 8-1. Cycle of the DES algorithm
Because DES limits its arithmetic and logical operations to 64-bit numbers, it can be used with software for
encryption on most of the current 64-bit operating systems.
The real weakness of this algorithm is against an attack called differential cryptanalysis in which a key can be
determined from chosen cipher texts in 258 searches. The cryptanalytic attack has not exposed any significant,
exploitable vulnerability in DES, but the risks of using the 56-bit key are increasing with easy availability of computing
power. Although the computing power or time needed to break DES is still significant, a determined hacker can
certainly decrypt text encrypted with DES. If a triple-DES approach (invoking DES three times for encryption using
the sequence: encryption via Key1, decryption using Key2, encryption using Key3) is used, the effective key length
becomes 112 (if only two of the three keys are unique) or 168 bits (if Key1, Key2, and Key3 are all unique), increasing
the difficulty of attack exponentially. DES can be used in the short term, but is certainly at end-of-life and needs to be
replaced by a more robust algorithm.
Advanced Encryption Standard (AES)
In 1997, the US National Institute of Standards and Technology called for a new encryption algorithm; subsequently,
Advanced Encryption Standard (AES) became the new standard in 2001. Originally called Rijndael, AES is also a block
cipher and uses multiple cycles, or rounds, to encrypt data using an input data block size of 128. Encryption keys of
128, 192, and 256 bits require 10, 12, or 14 cycles of encryption, respectively. The cycle of AES is simple, involving a
substitution, two permuting functions, and a keying function (see the sidebar “AES Algorithm” for more detail). There
are no known weaknesses of AES and it is in wide commercial use.
148
CHAPTER 8 ■ ENCRYPTION IN HADOOP
AES ALGORITHM
To help you visualize the operations of AES, let me first assume input data to be 9 bytes long and represent the
AES matrix as a 3 × 3 array with the data bytes b0 through b8.
Depicted in Figure 8-2, each round of the AES algorithm consists of the following four steps:
Substitute: To diffuse the data, each byte of a 128-bit data block is substituted using a
substitution table.
Shift row: The rows of data are permuted by a left circular shift; the first (leftmost, high
order) n elements of row n are shifted around to the end (rightmost, low order). Therefore, a
row n is shifted left circular (n – 1) bytes.
Mix columns: To transform the columns, the three elements of each column are multiplied
by a polynomial. For each element the bits are shifted left and exclusive-ORed with
themselves to diffuse each element of the column over all three elements of that column.
Add round key: Last, a portion of the key unique to this cycle (subkey) is exclusive-ORed or
added to each data column. A subkey is derived from the key using a series of permutation,
substitution, and ex-OR operations on the key.
Input block
128 bit
128 bit key
Series of
permutation,
substitution
and ex-or
operations
applied
Subkey generated
Column bits
shifted left
and ex-ORed
with
themselves
Subkey ex-ORed
to each data
column
d 0 d 1 d 2
d 5 d 3 d 4
d 7 d 8 d 6
c0 c1 c2
c5 c3 c4
c7 c8 c6
s 0 s 1 s 2
s 4 s 5 s 3
s 8 s 6 s 7
s 0 s 1 s 2
s 3 s 4 s 5
s 6 s 7 s 8
b 0 b 1 b 2
b 3 b 4 b 5
b 6 b 7 b 8
![]()
![]()
![]()
Each byte
substituted
using a
substitution
table
![]()
Each byte
shifted with
left circular
shift (n-1)
bytes
![]()
![]()
Cycles continue
15 more times……
Figure 8-2. Cycle of the AES algorithm
149
CHAPTER 8 ■ ENCRYPTION IN HADOOP
Rivest-Shamir-Adelman Encryption
With DES, AES, and other symmetric key algorithms, each pair of users needs a separate key. Each time a user (n + 1)
is added, n more keys are required, making it hard to track keys for each additional user with whom you need to
communicate. Determining as well as distributing these keys can be a problem—as can maintaining security for the
distributed keys because they can’t all be memorized. Asymmetric or public key encryption, however, helps you avoid
this and many other issues encountered with symmetric key encryption. The most famous algorithm that uses public
key encryption is the Rivest-Shamir-Adelman (RSA) algorithm. Introduced in 1978 and named after its three inventors
(Rivest, Shamir, and Adelman), RSA remains secure to date with no serious flaws yet found. To understand how RSA
works, see the “Rivest-Shamir-Adelman (RSA) Encryption” sidebar.
RIVEST-SHAMIR-ADELMAN (RSA) ENCRYPTION
The RSA encryption algorithm combines results from number theory with the degree of difficulty in determining the
prime factors of a given number. The RSA algorithm operates with arithmetic mod n; mod n for a number P is the
remainder when you divide P by n.
The two keys used in RSA for decryption and encryption are interchangeable; either can be chosen as the public
key and the other can be used as private key. Any plain text block P is encrypted as Pe mod n. Because the
exponentiation is performed, mod n and e as well as n are very large numbers (e is typically 100 digits and ntypically 200), and factoring Pe to decrypt the encrypted plain text is almost impossible. The decrypting key d
is so chosen that (Pe)d mod n = P. Therefore, the legitimate receiver who knows d can simply determine (P e)d
mod n = P and thus recover P without the need to factor Pe. The encryption algorithm is based on the underlying
problem of factoring large numbers, which has no easy or fast solution.
How are keys determined for encryption? If your plain text is P and you are computing Pe mod n, then the
encryption keys will be the numbers e and n, and the decryption keys will be d and n. A product of the two prime
numbers p and q, the value of n should be very large, typically almost 100 digits or approximately 200 decimal
digits (or 512 bits) long. If needed, n can be 768 bits or even 1024 bits. Larger the value of n, larger the degree of
difficulty in factoring n to determine p and q.
As a next step, a number e is chosen such that e has no factors in common with (p - 1) × (q - 1). One way of
ensuring this is to choose e as a prime number larger than (p - 1) as well as (q - 1).
Last, select such a number d that mathematically:
e × d = 1 mod (p - 1) × (q - 1)
As you can see, even though n is known to be the product of two primes, if they are large, it is not feasible to
determine the primes p and q or the private key d from e. Therefore, this scheme provides adequate security
for d. That is also the reason RSA is secure and used commercially. It is important to note, though, that due to
improved algorithms and increased computing power RSA keys up to 1024 bits have been broken (though not
trivially by any means). Therefore, the key size considered secure enough for most applications is 2048 bits; for
more sensitive data, you should use 4096 bits.
150
Digital Signature Algorithm Encryption (DSA)
CHAPTER 8 ■ ENCRYPTION IN HADOOP
Another popular algorithm using public key encryption is DSA (Digital Signature Algorithm). Although the original
purpose of this algorithm was signing, it can be used for encrypting, too. DSA security has a theoretical mathematical
basis based on the discrete logarithm problem and is designed using the assumption that a discrete logarithm
problem has no quick or efficient solution. Table 8-1 compares DSA with RSA.
Table 8-1. DSA vs. RSA | ||
Attribute | DSA | RSA |
Key generation | Faster | |
Encryption | Faster | |
Decryption | Faster** | |
Digital signature generation | Faster | |
Digital signature verification | Faster | |
Slower client | Preferable | |
Slower server | Preferable | |
**Please note that “Faster” also implies less usage of computational resources
To summarize, DSA and RSA have almost the same cryptographic strengths, although each has its own
performance advantages. In case of performance issues, it might be a good idea to evaluate where the problem lies
(at the client or server) and base your choice of key algorithm on that.
Applications of Encryption
In many cases, one type of encryption is more suited for your needs than another, or you may need a combination of
encryption methods to satisfy your needs. Four common applications of encryption algorithms that you’ll encounter
are cryptographic hash functions, key exchange, digital signatures, and certificates. For HDFS, client data access uses
TCP/IP protocol, which in turn uses SASL as well as data encryption keys. Hadoop web consoles and MapReduce
shuffle use secure HTTP that uses public key certificates. Intel’s Hadoop distribution (now Project Rhino) uses
symmetric keys for encryption at rest and certificates for encrypted data processing through MapReduce jobs. To
better appreciate how Hadoop and others use these applications, you need to understand how each works.
Hash Functions
In some situations, integrity is a bigger concern than secrecy. For example, in a document management system that
stores legal documents or manages loans, knowing that a document has not been altered is important. So, encryption
can be used to provide integrity as well.
In most files, components of the content are not bound together in any way. In other words, each character is
independent in a file, and even though changing one value affects the integrity of the file, it can easily go undetected.
Encryption can be used to “seal” a file so that any change can be easily detected. One way of providing this seal is
to compute a cryptographic function, called a hash or checksum, or a message digest, of the file. Because the hash
function depends on all bits of the file being sealed, altering one bit will alter the checksum result. Each time the file is
accessed or used, the hash function recomputes the checksum, and as long as the computed checksum matches the
stored value, you know the file has not been changed.
151
CHAPTER 8 ■ ENCRYPTION IN HADOOP
DES and AES work well for sealing values, because a key is needed to modify the stored value (to match modified
data). Block ciphers also use a technique called chaining: a block is linked to the previous block’s value and hence to
all previous blocks in a file like a chain by using an exclusive OR to combine the encrypted previous block with the
encryption of the current one. Subsequently, a file’s cryptographic checksum could be the last block of the chained
encryption of a file because that block depends on all other blocks. Popular hash functions are MD4, MD5
(MD meaning Message Digest), and SHA/SHS (Secure Hash Algorithm or Standard). In fact, Hadoop uses the SASL
MD5-DIGEST mechanism for authentication when a Hadoop client with Hadoop token credentials connects to a
Hadoop daemon (e.g., a MapReduce task reading/writing to HDFS).
Key Exchange
Suppose you need to exchange information with an unknown person (who does not know you either), while making
sure that no one else has access to the information. The solution is public key cryptography. Because asymmetric keys
come in pairs, one half of the pair can be exposed without compromising the other half. A private key can be used to
encrypt, and the recipient just needs to have access to the public key to decrypt it. To understand the significance of
this, consider an example key exchange.
Suppose Sam and Roy need to exchange a shared symmetric key, and both have public keys for a common
encryption algorithm (call these KPUB-S and KPUB-R) as well as private keys (call these KPRIV-S and KPRIV-R). The simplest
solution is for Sam to choose any symmetric key K, and encrypt it using his private key (KPRIV-S) and send to Roy, who
can use Sam’s public key to remove the encryption and obtain K. Unfortunately, anyone with access to Sam’s public
key can also obtain the symmetric key K that is only meant for Roy. So, a more secure solution is for Sam to first
encrypt the symmetric key K using his own private key and then encrypt it again using Roy’s public key. Then, Roy
can use his private key to decrypt the first level of encryption (outer encryption)—something only he can do—and
then use Sam’s public key to decrypt the “inner encryption” (proving that communication came from Sam). So, in
conclusion, the symmetric key can be exchanged without compromising security.
Digital Signatures and Certificates
Today, most of our daily transactions are conducted in the digital world, so the concept of a signature for approval
has evolved to a model that relies on mutual authentication of digital signatures. A digital signature is a protocol that
works like a real signature: it can provide a unique mark for a sender, and enable others to identify a sender from that
mark and thereby confirm an agreement. Digital signatures need the following properties:
Unreproducible
Uniquely traceable source authenticity (from expected source only)
Inseparable from message
Immutable after being transmitted
Have recent one-time use and should not allow duplicate usage
Public key encryption systems are ideally suited to digital signatures. For example, a publishing company can
first encrypt a contract using their own private key and then encrypt it again using the author’s public key. The author
can use his private key to decrypt the first level of encryption, and then use publisher’s public key to decrypt the
inner encryption to get to the contract. After that, the author can “sign” it by creating a hash value of the contract and
then encrypting the contract and the hash with his own private key. Finally, he can add one more layer of encryption
by encrypting again using the publisher’s public key and then e-mail the encrypted contract back to the publisher.
Because only the author and publisher have access to their private keys, the exchange clearly is unforgeable and
uniquely authentic. The hash function and checksum confirm immutability (assuming an initial checksum of the
contract was computed and saved for comparison), while the frequency and timestamps of the e-mails ensure
one-time recent usage. Figure 8-3 summarizes the process.
152
CHAPTER 8 ■ ENCRYPTION IN HADOOP
7 Contract emailed back to Publisher
Publishing company
encrypts contract using
their private key KPRIV-P
6
E(E(C+H,K ), K )
PRIV-A PUB-P
1 E(C,KPRIV-P)
Second level of
encryption using Author’s
Author creates a hash
value of contract, encrypts
hash & contract with his
private key KPRIV-A and last
with Publisher’s public key
K
public key K
PUB-P
PUB-A
5 D(D(E(E(C,K
PRIV-P PUB-A PRIV-A PUB-P
), K ),K ), K )
2 E(E(C,KPRIV-P), KPUB-A)
Author uses his private
key KPRIV-Ato decrypt 1st
level of encryption
Author uses Publisher’s
public key KPUB-Pto decrypt
2ndlevel of encryption
3 Contract sent
to author
4
D(E(E(C,K ), K ),K )
PRIV-P PUB-A PRIV-A
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Figure 8-3. Using Digital signatures for encrypted communication
In Figure 8-3, E(C,KPRIV-P) means contract C was encrypted using KPRIV-P. Similarly, D(E(E(C,KPRIV-P), KPUB-A), KPRIV-A)
means the first level of the doubly encrypted contract sent to the author was decrypted using KPRIV-A.
Founded on trust between parties through a common respected individual, a digital certificate serves a similar role
among multiple parties that a digital signature does for two individuals. A public key and user’s identity are associated
in a certificate, which is then “signed” by a certificate authority, certifying the accuracy of the association and
authenticating identity.
For example, a publishing company might set up a certificate scheme to authenticate authors, their agents,
and company editors in the following way. First, the publisher selects a public key pair, posts the public key where
everyone in the company has access to it, and retains the private key. Then, each editor creates a public key pair, puts
the public key in a message together with his or her identity, and passes the message securely to the publisher. The
publisher signs it by creating a hash value of the message and then encrypting the message and the hash with his or
her private key. By signing the message, the publisher affirms that the public key (the editor’s) and the identity (also
the editor’s) in the message are for the same person. This message is called the editor’s certificate. The author can
create a message with his public key, and the author’s agent can sign, hash, and return it. That will be the author’s
certificate. So, the author and editor’s certificates can thus be set up and used for verifying their identities. Anyone
can verify the editor’s certificate by starting with the publisher’s public key and decrypting the editor’s certificate to
retrieve his or her public key and identity. The author’s certificate can be verified by starting with the public key
the agent obtained from the publisher and using that to decrypt the certificate to retrieve the author’s public key
and identity.
Because Hadoop uses different types of encryption for its various components, I will briefly discuss where each of
these encryptions is used in the next section.
Hadoop Encryption Options Overview
When considering encryption of sensitive data in Hadoop, you need to consider data “at rest” stored on disks within
your cluster nodes, and also data in transit, which is moved during communication among the various nodes and
also between nodes and clients. Chapter 4 explained the details of securing data in transit between nodes and
applications; you can configure individual Hadoop ecosystem components for encryption (using the component’s
configuration file) just as you would configure Hadoop’s RPC communication for encryption. For example, to
configure SSL encryption for Hive, you would need to change configuration within hive-site.xml (the property
hive.server2.use.SSL in hive-site.xml needs to be set to true and the KeyStore needs to be specified using
properties hive.server2.keystore.path and hive.server2.keystore.password). This chapter, therefore, focuses
on configuring Hadoop data at rest.
153
CHAPTER 8 ■ ENCRYPTION IN HADOOP
■ Note Encryption is a CPU-intensive activity that can tax your hardware and slow its processing. Weigh the decision to
use encryption carefully. If you determine encryption is necessary, implement it for all the data stored within your cluster
as well as for processing related to that data.
For a Hadoop cluster, data at rest is the data distributed on all the DataNodes. Need for encryption may be
because the data is sensitive and the information needs to be protected, or perhaps encryption is necessary for
compliance with legal regulations like the insurance industry’s HIPAA or the financial industry’s SOX.
Although no Hadoop distribution currently provides encryption at rest, such major vendors as Cloudera and
Hortonworks offer third-party solutions. For example, Cloudera works with zNcrypt from Gazzang to provide
encryption at rest for data blocks as well as files. For additional protection, zNcrypt uses process-based ACLs and
keys. In addition, Amazon Web Services (AWS) offers encryption at rest with its Elastic MapReduce web service and S3
storage (you’ll learn more about this shortly), and Intel’s distribution of Hadoop also offers encryption at rest. But all
these solutions are either proprietary or limit you to a particular distribution of Hadoop.
For an open source solution to encrypt Hadoop data at rest, you can use Project Rhino. In 2013, Intel started
an open source project to improve the security capabilities of Hadoop and the Hadoop ecosystem by contributing
code to Apache. This code is not yet implemented in Apache Foundation’s Hadoop distribution, but it contains
enhancements that include distributed key management and the capability to do encryption at rest. The overall goals
for this open source project are as follows:
Support for encryption and key management
A common authorization framework (beyond ACLs)
A common token-based authentication framework
Security improvements to HBase
Improved security auditing
You can check the progress of Project Rhino at https://github.com/intel-hadoop/project-rhino, and learn
more about it in the next section.
Encryption Using Intel’s Hadoop Distro
In 2013, Intel announced its own Hadoop distribution—a strange decision for a hardware manufacturing company,
entering the Big Data arena belatedly with a Hadoop distribution. Intel, however, assured the Hadoop world that its
intentions were only to contribute to the Hadoop ecosystem (Apache Foundation) and help out with Hadoop security
concerns. Intel claimed its Hadoop distribution worked in perfect harmony with specific Intel chips (used as the CPU)
to perform encryption and decryption about 10 to 15 times faster than current alternatives.
Around the same time, I had a chance to work with an Intel team on a pilot project for a client who needed
data stored within HDFS to be encrypted, and I got to know how Intel’s encryption worked. The client used Hive
for queries and reports and Intel offered encryption that covered HDFS as well as Hive. Although the distribution I
used (which forms the basis of the information presented in this section), is not available commercially, most of the
functionality it offered will be available through Project Rhino and Cloudera’s Hadoop distribution (now that Intel has
invested in it).
Specifically, the Intel distribution used codecs to implement encryption (more on these in a moment) and
offered file-level encryption that could be used with Hive or HBase. It used symmetric as well as asymmetric keys in
conjunction with Java KeyStores (see the sidebar “KeyStores and TrustStores” for more information). The details of
the implementation I used will give you some insight into the potential of Project Rhino.
154
CHAPTER 8 ■ ENCRYPTION IN HADOOP
KEYSTORES AND TRUSTSTORES
A KeyStore is a database or repository of keys or trusted certificates that are used for a variety of purposes,
including authentication, encryption, and data integrity. A key entry contains the owner’s identity and private
key, whereas a trusted certificate entry contains only a public key in addition to the entity’s identity. For better
management and security, you can use two KeyStores: one containing your key entries and the other containing
your trusted certificate entries (including Certificate Authorities’ certificates). Access can be restricted to the
KeyStore with your private keys, while trusted certificates reside in a more publicly accessible TrustStore.
Used when making decisions about what to trust, a TrustStore contains certificates from someone you expect
to communicate with or from Certificate Authorities that you trust to identify other parties. Add an entry to a
TrustStore only if you trust the entity from which the potential entry originated.
Various types of KeyStores are available, such as PKCS12 or JKS. JKS is most commonly used in the Java world.
PKCS12 isn’t Java specific but is convenient to use with certificates that have private keys backed up from a
browser or the ones coming from OpenSSL-based tools. PKCS12 is mainly useful as a KeyStore but less so for
a TrustStore, because it needs to have a private key associated with certificates. JKS doesn’t require each entry
to be a private key entry, so you can use it as a TrustStore for certificates you trust but for which you don’t need
private keys.
Step-by-Step Implementation
The client’s requirement was encryption at rest for sensitive financial data stored within HDFS and accessed using
Hive. So, I had to make sure that the data file, which was pushed from SQL Server as a text file, was encrypted while it
was stored within HDFS and also was accessible normally (with decryption applied) through Hive, to authorized users
only. Figure 8-4 provides an overview of the encryption process.
![]()
![]()
![]()
![]()
![]()
1
Create symmetric
key and KeyStore
Pig
5
![]()
![]()
![]()
![]()
![]()
![]()
Hadoop component “Pig” uses the
symmetric key to encrypt HDFS file
2
Create a Key pair
(Private/Public key) and
Unencrypted
HDFS file
Encrypted
HDFS file
Hive
6
KeyStore
3
Create a TrustStore
(contains Public
certificates)
4
Extract certificates
from KeyStore
defined in step 2 and
import them into a
TrustStore
7
Authorized clients access
encrypted data through
MapReduce jobs using
certificates from TrustStore
Encrypted
Hive table
Client
Hadoop component
“Hive” defines an
encrypted external
table (uses the
symmetric key
created in step 1)
Figure 8-4. Data encryption at Intel Hadoop distribution
155
CHAPTER 8 ■ ENCRYPTION IN HADOOP
The first step to achieve my goal was to create a secret (symmetric) key and KeyStore with the command
(I created a directory/keys under my home directory and created all encryption related files there):
> keytool -genseckey -alias BCLKey -keypass bcl2601 -storepass bcl2601 -keyalg AES -keysize 256
-KeyStore BCLKeyStore.keystore -storetype JCEKS
This keytool command generates the secret key BCLKey and stores it in a newly created KeyStore called
BCLKeyStore. The keyalg parameter specifies the algorithm AES to be used to generate the secret key, and keysize
256 specifies the size of the key to be generated. Last, keypass is the password used to protect the secret key, and
storepass does the same for the KeyStore. You can adjust permissions for the KeyStore with:
> chmod 600 BCLKeyStore.keystore
Next, I created a key pair (private/public key) and KeyStore with the command:
> keytool -genkey -alias KEYCLUSTERPRIVATEASYM -keyalg RSA -keystore clusterprivate.keystore
-storepass 123456 -keypass 123456 -dname "CN= JohnDoe, OU=Development, O=Intel, L=Chicago, S=IL,
C=US" -storetypeJKS -keysize 1024
This generates a key pair (a public key and associated private key) and single-element certificate chain stored as
entry KEYCLUSTERPRIVATEASYM in the KeyStore clusterprivate.keystore. Notice the use of algorithm RSA for public
key encryption and the key length of 1024. The parameter dname specifies the name to be associated with alias, and is
used as the issuer and subject in the self-signed certificate.
I distributed the created KeyStore clusterprivate.keystore across the cluster using Intel Manager
(admin) ° configuration ° security ° Key Management.
To create a TrustStore, I next took the following steps:
Extract the certificate from the newly created KeyStore with the command:
keytool -export -alias KEYCLUSTERPRIVATEASYM -keystore clusterprivate.keystore -rfc
-file hivepublic.cert -storepass 123456
From the KeyStore clusterprivate.keystore, the command reads the certificate
associated with alias KEYCLUSTERPRIVATEASYM and stores it in the file hivepublic.
cert. The certificate will be output in the printable encoding format (as the -rfc option
indicates).
Create a TrustStore containing the public certificate:
keytool -import -alias HIVEKEYCLUSTERPUBLICASYM -file hivepublic.cert -keystore
clusterpublic.TrustStore -storepass123456
This command reads the certificate (or certificate chain) from the file hivepublic.
cert and stores it in the KeyStore (used as a TrustStore) entry identified by
HIVEKEYCLUSTERPUBLICASYM. The TrustStore clusterpublic.TrustStore is created and
the imported certificate is added to the list of trusted certificates.
Change clusterpublic.TrustStore ownership to root, group to hadoop, and permissions
"644" (read/write for root and read for members of all groups) with the
commands:
chmod 644 clusterpublic.TrustStore
chown root:hadoop clusterpublic.TrustStore
156
CHAPTER 8 ■ ENCRYPTION IN HADOOP
Create a file TrustStore.passwords, set its permission to “644”, and add the following
contents to the file: keystore.password=123456.
Copy the /keys directory and all of its files to all the other nodes in the cluster. On each
node, the KeyStore directory must be in /usr/lib/hadoop/.
With the TrustStore ready, I subsequently created a text file (bcl.txt) to use for testing encryption and copied it
to HDFS:
hadoop fs -mkdir /tmp/bcl.....
hadoop fs -put bcl.txt /tmp/bcl
I started Pig (> pig) and was taken to the grunt> prompt. I executed the following commands within Pig to
set all the required environment variables:
set KEY_PROVIDER_PARAMETERS 'keyStoreUrl=file:////root/bcl/BCLKeyStore.keystore&keyStoreType-
JCEKS&password=bcl2601';
set AGENT_SECRETS_PROTECTOR 'com.intel.hadoop.mapreduce.crypto.KeyStoreKeyProvider';
set AGENT_PUBLIC_KEY_PROVIDER'org.apache.hadoop.io.crypto.KeyStoreKeyProvider';
set AGENT_PUBLIC_KEY_PROVIDER_PARAMETERS 'keyStoreUrl=file:////keys/clusterpublic.TrustStore&keyStor
eType=JKS&password=123456';
set AGENT_PUBLIC_KEY_NAME 'HIVEKEYCLUSTERPUBLICASYM';
set pig.encrypt.keyProviderParameters 'keyStoreUrl=file:////root/bcl/BCLKeyStore.
keystore&keyStoreType-JCEKS&password=bcl2601';
Next, to read the bcl.txt file from HDFS, encrypt it, and store it into the same location in a directory named
bcl_encrypted, I issued the commands:
raw = LOAD '/tmp/bcl/bcl.txt' AS (name:chararray,age:int,country:chararray);
STORE rawINTO '/tmp/bcl/bcl_encrypted' USING PigStorage('\t','-keyName BCLKey');
After exiting Pig, I checked contents of the encrypted file by issuing the command:
hadoop fs -cat /tmp/bcl/bcl_encrypted/part-m-00000.aes
The control characters appeared to confirm the encryption. I created a Hive external table and pointed it to the
encrypted file using the following steps:
Start Hive.
Set the environment variables:
set hive.encrypt.master.keyName=BCLKey;
set hive.encrypt.master.keyProviderParameters=keyStoreUrl=file:////root/bcl/BCLKeyStore.k
eystore&keyStoreType=JCEKS&password=bcl2601;
set hive.encrypt.keyProviderParameters=keyStoreUrl=file:////root/bcl/BCLKeyStore.keystore
&keyStoreType=JCEKS&password=bcl2601;
set mapred.crypto.secrets.protector.class=com.intel.hadoop.mapreduce.cryptocontext.
provider.AgentSecretsProtector;
set mapred.agent.encryption.key.provider=org.apache.hadoop.io.crypto.KeyStoreKeyProvider;
set mapred.agent.encryption.key.provider.parameters=keyStoreUrl=file:////keys/
clusterpublic.TrustStore&keyStoreType=JKS&password=123456;
set mapred.agent.encryption.keyname=HIVEKEYCLUSTERPUBLICASYM;
157
CHAPTER 8 ■ ENCRYPTION IN HADOOP
Create an encrypted external table pointing to the encrypted data file created by Pig:
create external table bcl_encrypted_pig_data(name STRING, age INT, country STRING) ROW
FORMAT DELIMITEDFIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/tmp/bcl/bcl_
encrypted/' TBLPROPERTIES("hive.encrypt.enable"="true", "hive.encrypt.keyName"="BCLKey");
Once the table is created, decrypted data can be viewed by any authorized client
(having appropriate key and certificate files within /usr/lib/hadoop/keys directory) using
the select query (in Hive syntax) at the Hive prompt:
select * from bcl_encrypted_pig_data;
To summarize, to implement the Intel distribution for use with Hive, I set up the keys, KeyStores, and
certificates that were used for encryption. Then I extracted the certificate from the KeyStore and imported it into a
TrustStore. Note that although I created the key pair and certificate for a user JohnDoe in the example, for a multiuser
environment you will need to create a key pair and certificates for all authorized users.
A symmetric key was used to encrypt data within HDFS (and with Hive). MapReduce used a public key and
certificate, because client communication within Hive uses MapReduce. That’s also the reason a key pair and
certificate will be necessary for authorized users for Hive (who are authorized to access the encrypted data).
Special Classes Used by Intel Distro
The desired functionality of encryption at rest needs special codecs, classes, and logic implemented. Although many
classes and codecs were available, they didn’t work in harmony backed by a common logic to provide the encryption
functionality. Intel has added the underlying logic in its distribution.
For example, org.apache.hadoop.io.crypto.KeyStoreKeyProvider is an implementation of the class org.
apache.hadoop.io.crypto.KeyProvider. The corresponding Apache class for HBase is org.apache.hadoop.
hbase.io.crypto.KeyStoreKeyProvider, which is an implementation of org.apache.hadoop.hbase.io.crypto.
KeyProvider. This class is used to resolve keys from a protected KeyStore file on the local file system. Intel has used
this class to manage keys stored in KeyStore (and TrustStore) files. The other Hbase classes used are:
org.apache.hadoop.hbase.io.crypto.Cipher
org.apache.hadoop.hbase.io.crypto.Decryptor
org.apache.hadoop.hbase.io.crypto.Encryptor
How are these classes used? For example, in Java terms, the method Encryption.decryptWithSubjectKey for
class org.apache.hadoop.hbase.io.crypto.Cipher decrypts a block of encrypted data using the symmetric key
provided; whereas the method Encryption.encryptWithSubjectKey encrypts a block of data using the provided
symmetric key. So, to summarize, this class provides encryption/decryption using the symmetric key.
The Intel custom class com.intel.hadoop.mapreduce.crypto.KeyStoreKeyProvider was designed for encrypted
MapReduce processing and works similar to the Apache Hadoop crypto class mapred.crypto.KeyStoreKeyProvider.
It is adapted for use with MapReduce jobs and is capable of processing certificates as well as keys.
Most of these classes are developed and used by the Apache Foundation. The only difference is that the Apache
Foundation’s Hadoop distribution doesn’t use these classes to provide cumulative functionality of encryption at
rest, nor do any of the other distributions available commercially. Project Rhino is trying to remedy that situation,
and since even the Intel custom classes and codecs are available for their use, you can expect the encryption-at-rest
functionality to be available through Project Rhino very soon.
158
CHAPTER 8 ■ ENCRYPTION IN HADOOP
Using Amazon Web Services to Encrypt Your Data
As you have seen, installing and using encryption can be a tough task, but Amazon has consciously endeavored
to make it simple. AWS offers easy options that eliminate most of the work and time needed to install, configure,
manage, and use encryption with Hadoop. With AWS, you have the option of doing none, some or all of the work
depending on the configured service you rent. For example, if you need to focus on other parts of your project (such as
design of ETL for bulk load of data from RDBMS (relational database management system) to HDFS or Analytics), you
can have AWS take care of fully implementing encryption at rest for your data.
Deciding on a Model for Data Encryption and Storage
AWS provides several configurations or models for encryption usage. The first model, model A, lets you control the
encryption method as well as KMI (key management infrastructure). It offers you the utmost flexibility and control,
but you do all the work. Model B lets you control the encryption method while AWS stores the keys.; you still get to
manage your keys. The most rigid choice, model C, gives you no control over KMI or encryption method, although it is
the easiest to implement because AWS does it all for you. To implement model C, you need to use an AWS service that
supports server-side encryption, such as Amazon S3, Amazon EMR, Amazon Redshift, or Amazon Glacier.
To demonstrate, I will implement encryption at rest using Amazon’s model C. Why C? The basic steps are easy to
understand, and you can use the understanding you gain to implement model A, for which you need to implement all
the tasks (I have provided steps for implementing model A as a download on the Apress web site). I will use Amazon
EMR (or Elastic MapReduce, which provides an easy-to-use Hadoop implementation running on Amazon Elastic
Compute Cloud, or EC2) along with Amazon S3 for storage. Please note: One caveat of renting the EMR service is that
AWS charges by the “normalized” hour, not actual hours, because the plan uses multiple AWS “appliances” and at
least two EC2 instances.
If you are unfamiliar with AWS’s offerings, EC2 is the focal point of AWS. EC2 allows you to rent a virtual server (or
virtual machine) that is a preconfigured Amazon Machine Image with desired operating system and choice of virtual
hardware resources (CPU, RAM, disk storage, etc.). You can boot (or start) this virtual machine or instance and run
your own applications as desired. The term elastic refers to the flexible, pay-by-hour model for any instances that you
create and use. Figure 8-5 displays AWS management console. This is where you need to start for “renting” various
AWS components (assuming you have created an AWS account first): http://aws.amazon.com.

Figure 8-5. AWS Management console
159
CHAPTER 8 ■ ENCRYPTION IN HADOOP
Getting back to the implementation using model C, if you specify server-side encryption while procuring the
EMR cluster (choose the Elastic MapReduce option in the AWS console as shown in Figure 8-5), the EMR model
provides server-side encryption of your data and manages the encryption method as well as keys transparently for
you. Figure 8-6 depicts the “Envelope encryption” method AWS uses for server-side encryption. The basic steps are
as follows:
The AWS service generates a data key when you request that your data be encrypted.
AWS uses the data key for encrypting your data.
The encrypted data key and the encrypted data are stored using S3 storage
AWS uses the key-encrypting key (unique to S3 in this case) to encrypt the data key and
store it separately from the data key and encrypted data.
![]()
Data key
generated
AWS EMR cluster
Data key
used to
encrypt data
Output
data
Data needs
to be
encrypted
Encrypted
Data and Data
key stored
S3 storage
Key-encrypting
key stored
separately
Key
Generator
MapReduce
job
![]()
![]()
MapReduce
job runs on
the cluster
and
generates
data
![]()
![]()
Figure 8-6. Envelope encryption by AWS
For data retrieval and decryption, this process is reversed. First, the encrypted data key is decrypted using the
key-encrypting key, and then it is used to decrypt your data.
As you can see from Figure 8-6, the S3 storage service supports server-side encryption. Amazon S3 server-side
encryption uses 256-bit AES symmetric keys for data keys as well as master (key-encrypting) keys.
Encrypting a Data File Using Selected Model
In this section, I will discuss step-by-step implementation for the EMR-based model C, in which AWS manages your
encryption method and keys transparently. As mentioned earlier, you can find the steps to implement model A on the
Apress web site.
Create S3 Storage Through AWS
You need to create the storage first, because you will need it for your EMR cluster. Simply log in to the AWS
management console, select service S3, and create a bucket named htestbucket and a folder test within (Figure 8-7).
160
CHAPTER 8 ■ ENCRYPTION IN HADOOP

Figure 8-7. Create an S3 bucket and folder
Specify server-side encryption for folder test that you created (Figure 8-8).

Figure 8-8. Activate server-side encryption for a folder
Adjust the permissions for the bucket htestbucket created earlier, as necessary (Figure 8-9).

Figure 8-9. Adjust permissions for an S3 bucket
161
CHAPTER 8 ■ ENCRYPTION IN HADOOP
Create a Key Pair (bclkey) to Be Used for Authentication
Save the .pem file to your client. Use PuTTYgen to create a .ppk (private key file) that can be used for authentication
with PuTTY to connect to the EMR cluster (Master node). For details on using PuTTY and PuTTYgen, please see
Chapter 4 and Appendix B. Figure 8-10 shows the AWS screen for key pair creation. To reach it, choose service EC2 on
the AWS management console, and then the option Key Pairs.

Figure 8-10. Creating a key pair within AWS
Create an Access Key ID and a Secret Access Key
These keys are used as credentials for encryption and are associated with a user. If you don’t have any users created
and are using the root account for AWS, then you need to create these keys for root. From the Identity and Access
Management (IAM) Management console (Figure 8-11), select Dashboard, and then click the first option, Delete your
root access keys. (If you don’t have these keys created for root, you won’t see this warning). To reach IAM console,
choose service “Identity & Access Management” on the AWS management console.

Figure 8-11. IAM console for AWS
162
CHAPTER 8 ■ ENCRYPTION IN HADOOP
Click the Manage Security Credentials button, and ignore the warning to “Continue to Security credentials”
(Figure 8-12).

Figure 8-12. Creation of security credentials
Your AWS root account is like a UNIX root account, and AWS doesn’t recommend using that. Instead, create
user accounts with roles, permissions, and access keys as needed. If you do so, you can more easily customize
permissions without compromising security. Another thing to remember about using the root account is that
you can’t retrieve the access key ID or secret access key if you lose it! So, I created a user Bhushan for use with my
EMR cluster (Figure 8-13). I used the “Users” option and “Create New Users” button from the Identity and Access
Management (IAM) Management console (Figure 8-11) to create this new user.

Figure 8-13. Creation of a user for use with EMR cluster
163
CHAPTER 8 ■ ENCRYPTION IN HADOOP
To set your keys for a user, again begin on the IAM Management Console, and select the Users option, then a
specific user (or create a user). Next, open the Security Credentials area and create an access key ID and a secret
access key for the selected user (Figure 8-13).
Note When you create the access key ID and secret access key, you can download them and save them somewhere
safe as a backup. Taking this precaution is certainly easier than creating a fresh set of keys if you lose them.
Create the AWS EMR Cluster Specifying Server-Side Encryption
With the preparatory steps finished, you’re ready to create an EMR cluster. Log on to the AWS management console,
select the Elastic MapReduce service, and click the Create Cluster button. Select the “Server-side encryption” and
“Consistent view” configuration options and leave the others at their defaults (Figure 8-14).

Figure 8-14. Creation of EMR cluster
In the Hardware Configuration section (Figure 8-15), request one Master EC2 instance to run JobTracker and
NameNode and one Core EC2 instance to run TaskTrackers and DataNodes. (This is just for testing; in the real world,
you would need to procure multiple Master or Core instances depending on the processing power you require.) In
the Security Access section, specify one of the key pairs created earlier (bclkey), while in the IAM Roles section, set
EMR_DefaultRole and EMR_EC2_DefaultRole for the EMR roles. Make sure that these roles have permissions to access
the S3 storage (bucket and folders) and any other resources you need to use.
164
CHAPTER 8 ■ ENCRYPTION IN HADOOP

Figure 8-15. Hardware configuration for EMR cluster
After you check all the requested configuration options, click on the “Create Cluster” button at the bottom of the
screen to create an EMR cluster as per your requirements.
In a couple of minutes, you will receive a confirmation of cluster creation similar to Figure 8-16.

Figure 8-16. EMR cluster created
165
CHAPTER 8 ■ ENCRYPTION IN HADOOP
Test Encryption
As a final step, test if the “at rest” encryption between EMR and S3 is functional. As per the AWS and EMR
documentation, any MapReduce jobs transferring data from HDFS to S3 storage (or S3 to HDFS) should encrypt the
data written to persistent storage.
You can verify this using the Amazon utility S3DistCp, which is designed to move large amounts of data between
Amazon S3 and HDFS (from the EMR cluster). S3DistCp supports the ability to request Amazon S3 to use server-side
encryption when it writes EMR data to an Amazon S3 bucket you manage. Before you use it, however, you need to add
the following configuration to your core-site.xml (I have blanked out my access keys):
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>xxxxxxxxxxxxxxxxxxxx</value>
</property>
<property>
<name>fs.s3.awsAccessKeyId</name>
<value>yyyyyyyyyyyyyyyyyyyy</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>xxxxxxxxxxxxxxxxxxxx</value>
</property>
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>yyyyyyyyyyyyyyyyyyyy</value>
</property>
Remember to substitute values for your own access key ID and secret access key. There is no need to restart any
Hadoop daemons.
Next, make sure that the following jars exist in your /home/hadoop/lib (/lib under my Hadoop install directory).
If not, find and copy them there:
/home/hadoop/lib/emr-s3distcp-1.0.jar
/home/hadoop/lib/gson-2.1.jar
/home/hadoop/lib/emr-s3distcp-1.0.jar
/home/hadoop/lib/EmrMetrics-1.0.jar
/home/hadoop/lib/httpcore-4.1.jar
/home/hadoop/lib/httpclient-4.1.1.jar
Now, you’re ready to run the S3DistCp utility and copy a file test1 from HDFS to folder test for S3 bucket
htestbucket:
hadoop jar /home/hadoop/lib/emr-s3distcp-1.0.jar -libjars /home/hadoop/lib/gson-2.1.jar,/home/
hadoop/lib/emr-s3distcp-1.0.jar,/home/hadoop/lib/EmrMetrics-1.0.jar,/home/hadoop/lib/httpcore-
4.1.jar,/home/hadoop/lib/httpclient-4.1.1.jar --src/tmp/test1 --dest s3://htestbucket/test/
--disableMultipartUpload --s3ServerSideEncryption
166
CHAPTER 8 ■ ENCRYPTION IN HADOOP
My example produced the following response in a few seconds:
14/10/10 03:27:47 INFO s3distcp.S3DistCp: Running with args: -libjars /home/hadoop/lib/gson-
2.1.jar,/home/hadoop/lib/emr-s3distcp-1.0.jar,/home/hadoop/lib/EmrMetrics-1.0.jar,/home/hadoop/lib/
httpcore-4.1.jar,/home/hadoop/lib/httpclient-4.1.1.jar --src/tmp/test1 --dest s3://htestbucket/
test/ --disableMultipartUpload --s3ServerSideEncryption
....
....
14/10/10 03:27:51 INFO client.RMProxy: Connecting to ResourceManager at
14/10/10 03:27:54 INFO mapreduce.Job: The url to track the job: http://10.232.45.82:9046/proxy/
application_1412889867251_0001/
14/10/10 03:27:54 INFO mapreduce.Job: Running job: job_1412889867251_0001
14/10/10 03:28:12INFO mapreduce.Job: map 0% reduce 0%
....
....
14/10/10 03:30:17 INFO mapreduce.Job: map 100% reduce 100%
14/10/10 03:30:18 INFO mapreduce.Job: Job job_1412889867251_0001 completed successfully
Clearly, the MapReduce job copied the file successfully to S3 storage. Now, you need to verify if the file is stored
encrypted within S3. To do so, use the S3 management console and check properties of file test1 within folder test in
bucket htestbucket (Figure 8-17).

Figure 8-17. Verifying server-side encryption for MapReduce job
As you can see, the property Server Side Encryption is set to AES-256, meaning the MapReduce job from the EMR
cluster successfully copied data to S3 storage with server-side encryption!
You can try other ways of invoking MapReduce jobs (e.g., Hive queries or Pig scripts) and write to S3 storage
to verify that the stored data is indeed encrypted. You can also use S3DistCp to transfer data from your own local
Hadoop cluster to Amazon S3 storage. Just make sure that you copy the AWS credentials in core-site.xml on all
nodes within your local cluster and that the previously listed six .jar files are in the /lib subdirectory of your Hadoop
install directory.
If you’d like to compare this implementation of encryption using AWS EMR with implementation of the more
hands-on model A (in which you manage encryption and keys, plus you need to install specific software on EC2
instances for implementing encryption), remember you can download and review those steps from the Apress
web site.
167
CHAPTER 8 ■ ENCRYPTION IN HADOOP
You’ve now seen both alternatives for providing encryption at rest with Hadoop (using Intel’s Hadoop
distribution and using AWS). If you review carefully, you will realize that they do have commonalities in implementing
encryption. Figure 8-18 summarizes the generic steps.
![]()
![]()
![]()
![]()
![]()
![]()
6
![]()
DataNode communication to
decrypt and retrieve
subsequent data blocks
DataNode2
3 If keys are
authenticated
NameNode
provides a list
DataNode1
of nodes holding
the data
NameNode
2
![]()
DataNode3
Keystore with symmetric
keys and public keypairs /
certificates
NameNode
authenticates the
request using its
own keystores to
compare the
symmetric key
5 DataNode uses the key to
decrypt the data block and if
successful, passes it back. Also,
respective DataNodes pass
Keystores with
symmetric keys and
public keypairs /
certificates
Truststores with
certificates and
keystore with
symmetric key(s)
Client-side
certificates and
symmetric keys
are used for
data access
1
Client requests
encrypted data
to NameNode
subsequent data blocks back
4 Client requests data blocks of
encrypted data to DataNode
Client
Figure 8-18. DataNode uses symmetric key (from client) to decrypt the data block and if successful, retrieves the
data block. Respective DataNodes retrieve and pass subsequent data blocks back
Summary
Encryption at rest with Hadoop is still a work in progress, especially for the open source world. Perhaps when Hadoop
is used more extensively in the corporate world, our options will improve. For now, you must turn to paid third-
party solutions. The downside to these third-party solutions is that even though they claim to work with specific
distributions, their claims are difficult to verify. Also, it is not clear how much custom code they add to your Hadoop
install and what kind of performance you actually get for encryption/decryption. Last, these solutions are not
developed or tested by trained cryptographers or cryptanalysts. So, there is no reliability or guarantee that they are
(and will be) “unbreakable.”
Intel entered the Hadoop and encryption-at-rest arena with lot of publicity and hype, but quickly backed off and
invested in Cloudera instead. Now the future of Project Rhino and possible integration of that code with Cloudera’s
distribution doesn’t seem very clear. There are open source applications in bits and pieces, but a robust, integrated
solution that can satisfy the practical encryption needs of a serious Hadoop practitioner doesn’t exist yet.
For now, let’s hope that this Hadoop area generates enough interest among users to drive more options in the
future for implementing encryption using open source solutions.
Whatever the future holds, for the present, this is the last chapter. I sincerely hope this book has facilitated your
understanding of Hadoop security options and helps you make your environment secure!
168
PART V
Appendices
APPENDIX A
Pageant Use and Implementation
Pageant is an SSH authentication agent that can be used with PuTTY or WinSCP for holding your decrypted keys in
memory, so that you don’t need to enter your passphrase to decrypt your key every time you are authenticating to a
server using a key pair (Chapter 4 discusses key-based authentication in detail). If you are using multiple key pairs
to authenticate to multiple servers, Pageant is even more useful. You can use Pageant to hold all your decrypted keys
in memory, meaning you need to enter the respective passphrases only once when you start your Windows session.
When you log off your Windows session, Pageant exits without saving the decrypted keys on disk, which is the reason
you need to enter your passphrase again when you start your Windows session.
Because Pageant is part of PuTTY installation package, you can download it from the same URL
(http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html). When you run the executable file
Pageant.exe to start Pageant, an icon that looks like a computer wearing a hat will appear in your system tray.
Right-click the icon to invoke the Pageant menu, and then select the menu option you need: New Session, Saved
Sessions, View Keys, Add Key, About, or Exit. If you select View Keys before adding keys, however, you will just see an
empty list box.
Using Pageant
To use Pageant, you need first to generate a key pair and copy the public key to the server to which you need to
connect. For example, I generated a key pair and saved the keys as keytest.ppk (private key) and keytest.pub
(public key). I then encrypted the private key using a passphrase. Because I wanted to connect to the host
pract_hdp_sec, I pasted my public key in the authorized_keys file in .ssh directory (as discussed in Chapter 4).
Next, I will store the decrypted private key in Pageant. Figure A-1 illustrates selecting and adding the key.
171
APPENDIX A ■ PAGEANT USE AND IMPLEMENTATION

Figure A-1. Adding a key to Pageant
When you select a key (here, testkey.ppk), you are prompted for the passphrase (Figure A-2).

Figure A-2. Using Pageant to store passphrase for a key
After you enter the right passphrase, Pageant decrypts your private key and holds it in memory until you log off
your Windows session. You can see your key listed within Pageant, as shown in Figure A-3.
172
APPENDIX A ■ PAGEANT USE AND IMPLEMENTATION

Figure A-3. Listing a stored key within Pageant
Now, you just need to specify your private key as means of authorization within PuTTY (Figure A-4).
173
APPENDIX A ■ PAGEANT USE AND IMPLEMENTATION

Figure A-4. Specifying key-based authentication within PuTTY
Next time you want to connect to the server pract_hdp_sec, just open a PuTTY session, and it will prompt you for
login name. Once you enter the login name, PuTTY directly connects you to the server, as you can see in Figure A-5.
174
APPENDIX A ■ PAGEANT USE AND IMPLEMENTATION

Figure A-5. Key-based authentication performed using decrypted key from Pageant
PuTTY recognizes that Pageant is running, retrieves the decrypted key automatically, and uses it to authenticate.
You can open as many PuTTY sessions for the same server as you need without typing your passphrase again.
In addition, Pageant can load multiple private keys automatically when it starts up. For example, suppose you
need to connect to ten servers on a daily basis. Manually adding the keys every day to Pageant is difficult as well as
error-prone. To automatically load multiple keys, use a Pageant command line similar to the following; the directory
path, of course, depends on where your Pageant.exe or your private key file (.ppk file) is located:
C:\Users\Administrator\Desktop>pageant.exe c:\bhushan\keytest.ppk c:\bhushan\bhushan.ppk
You can add multiple keys separated by space. If the keys are encrypted, Pageant will prompt for passphrases at
startup. If Pageant is already running and you execute this command, it will load keys into the existing Pageant.
You can also create a shortcut and specify the command line there, as shown in Figure A-6.
175
APPENDIX A ■ PAGEANT USE AND IMPLEMENTATION

Figure A-6. Specifying a starting (default) directory for multiple keys
If you have just one private key, specify its full path within the Target field:
C:\Users\Administrator\Desktop>pageant.exe c:\bhushan\keytest.ppk
If you have multiple keys and the path is long, instead of specifying path for each key, you can just specify a
starting directory. For example, to specify a starting point for my previous multi-key example, in the Target field enter
C:\Users\Administrator\Desktop>pageant.exe keytest.ppkand in the Start in field enter C:\Bhushan.
After Pageant initializes and loads the keys specified on its command line, you can direct Pageant to start another
program. This program (for e.g. WinSCP or PuTTY etc.) can then use the keys that Pageant loaded. The syntax is as follows:
C:\Users\Administrator\Desktop>pageant.exe c:\bhushan\keytest.ppk -c C:\PuTTY\putty.exe
Security Considerations
Holding your decrypted private keys in Pageant is more secure than storing key files on your local disk drive, but still
has some known security issues.
For example, Windows doesn’t protect “swapped” data (memory data written to a system swap file) in any
way. So, if you using Pageant for a long time, the decrypted key data could likely be swapped and written to disk.
A malicious attacker who gains access to your hard disk could also gain access to your keys. This is, of course, much
more secure than storing an unencrypted file on your local disk drive, but still has vulnerabilities.
Windows only has safeguards to prevent excutable code writing into another excutable program’s memory space;
but still provides Read access to it. In other words, programs can access each other’s memory space, which is intended
as a way to assist in debugging. Unfortunately, malicious programs can exploit this feature and can access Pageant’s
memory to extract the decrypted keys and use them for unlawful purposes.
These risks can easily be mitigated, however, by making sure that your network infrastructure is secure and
firewalls in place.
176
APPENDIX B
PuTTY and SSH Implementation for
Linux-Based Clients
In the section “Key-Based Authentication Using PuTTY” in Chapter 4, you reviewed how PuTTY can effectively be
used for key-based authentication for a Windows-based client. What about key-based authentication for Linux-based
clients? The answer is PuTTY again.
You can download the Linux-based version of PuTTY from various sources. I used rpm (Red Hat Package
Manager, a package management system used for software distribution in Linux domain) for the latest PuTTY version
(0.63) for CentOS 6.2; the file is putty-0.63-1.el6.rf.x86_64.rpm. You can download the rpm from various sources;
you just need to search for your operating system. After you download the file, install the rpm:
rpm -Uvh putty-0.63-1.el6.rf.x86_64.rpm
To generate a pair of private and public keys in the Linux version of PuTTY, you use a command line utility
called PuTTYgen, which is installed automatically when you install PuTTY via rpm. To generate the key pair, use the
following command:
puttygen -t rsa -C "my key pair" -o bcl.ppk
PuTTYgen then prompts you to enter a passphrase. Make a note of it, because you will need to specify the same
passphrase every time you use the key pair to connect to a host.
You can save the key in your home directory (easy to remember the location) and then export the public key to
the authorized_keys file using the following command:
puttygen -L bcl.ppk >> $HOME/.ssh/authorized_keys
Next, copy the authorized_keys file to hosts you need to connect to (using PuTTY). Note that if your host already
has an authorized_keys file in the $HOME/.ssh directory, then copy your newly created file using a different name and
append its contents to the existing authorized_keys file.
Next, invoke PuTTY at the command prompt by typing putty. The interface looks identical to its Windows-based
counterpart (Figure B-1).
177
APPENDIX B ■ PUTTY AND SSH IMPLEMENTATION FOR LINUX-BASED CLIENTS

Figure B-1. Linux PuTTY with key-based authentication
For connecting to a server, click the option SSH to open the drop-down and then click the option Auth
(authorization) under that. On the right side of the PuTTY interface, click Browse and select the private key file you
saved earlier (/root/bcl.ppk in this example). Click Open to open a new session.
That’s it! You are now ready to use PuTTY with key-based authentication! Figure B-2 shows the login prompt and
the prompt for a passphrase.

Figure B-2. Using Linux PuTTY with passphrase
178
Using SSH for Remote Access
APPENDIX B ■ PUTTY AND SSH IMPLEMENTATION FOR LINUX-BASED CLIENTS
You can also use SSH to connect remotely to a host. If you want to use a key pair for authentication with SSH, you first
need to use a utility called ssh-keygen to generate the keys. By default, the keys are saved in the $HOME/.ssh directory
as files id_rsa (private key) and id_rsa.pub (public key). Figure B-3 shows a key pair generated in the default location
without a passphrase (you can specify a passphrase for additional security).

Figure B-3. Using ssh-keygen to generate a key pair for remote access
The public key can be copied to appropriate hosts and appended to the existing authorized_keys file in
$HOME/.ssh directory. To use the private key file to connect to a host, use the syntax:
ssh -i ~/.ssh/id_rsa root@Master
Here, root is the user and Master is the server to which you are trying to connect.
In case you have multiple hosts and you want to organize the process of connecting to them, you can create host
entries in a file called config in directory $HOME/.ssh. The entries are created using the following format:
Host Master
User root
HostName Master
IdentityFile ~/.ssh/id_rsa
Then, you can simply connect as:
ssh -f -N Master
179
APPENDIX C
Setting Up a KeyStore and TrustStore
for HTTP Encryption
A KeyStore is a database or repository of keys and certificates that are used for a variety of purposes, including
authentication, encryption, or data integrity. In general, a KeyStore contains information of two types: key entries
and trusted certificates.
I have already discussed how to configure your Hadoop cluster with network encryption in Chapter 4’s
“Encrypting HTTP Communication” section. As a part of that set up, you need to create HTTPS certificates and KeyStores.
Create HTTPS Certificates and KeyStore/TrustStore Files
To create HTTPS certificates and KeyStores, you need to perform the following steps:
For each host, create a directory for storing the KeyStore and TrustStore at SKEYLOC (you
can substitute the directory name of your liking).
For each host, create a key pair and a separate KeyStore. If your operating system
command prompt is $, you have set the SKEYLOC directory parameter, and assuming an
example of a two-node cluster with hosts pract_hdp_sec and pract_hdp_sec2,
the necessary code would look like the following:
$ cd $SKEYLOC
$ keytool -genkey -alias pract_hdp_sec -keyalg RSA -keysize 1024 –dname "CN=pract_hdp_sec,OU=IT,
O=Ipsos,L=Chicago,ST=IL,C=us" -keypass12345678 -keystore phsKeyStore1 -storepass 87654321
$ keytool -genkey -alias pract_hdp_sec2 -keyalg RSA -keysize 1024 -dname "CN=pract_hdp_sec2,OU=IT,
O=Ipsos,L=Chicago,ST=IL,C=us" -keypass56781234 -keystore phsKeyStore2 –storepass 43218765
This code generates two key pairs (a public key and associated private key for each) and single-element
certificate chain, stored as entry pract_hdp_sec in KeyStore phsKeyStore1 and entry pract_hdp_sec2 in KeyStore
phsKeyStore2, respectively. Notice the use of the RSA algorithm for public key encryption and the key length of 1024.
181
APPENDIX C ■ SETTING UP A KEYSTORE AND TRUSTSTORE FOR HTTP ENCRYPTION
For each host, export the certificate’s public key to a separate certificate file:
$cd $SKEYLOC;
$keytool -export -alias pract_hdp_sec -keystore phsKeyStore1 -rfc -file pract_hdp_sec_cert
-storepass 87654321
$keytool -export -alias pract_hdp_sec2 -keystore phsKeyStore2 -rfc -file pract_hdp_sec2_cert
-storepass 43218765
For all the hosts, import the certificates into TrustStore file:
$cd $SKEYLOC;
$keytool -import -noprompt -alias pract_hdp_sec -file pract_hdp_sec_cert -keystore phsTrustStore1
-storepass 4324324
$keytool -import -noprompt -alias pract_hdp_sec2 -file pract_hdp_sec2_cert -keystore phsTrustStore1
-storepass 4324324
Note that the TrustStore file is newly created in case it doesn’t exist.
Copy the KeyStore and TrustStore files to the corresponding nodes:
$scp phsKeyStore1 phsTrustStore1 root@pract_hdp_sec:/etc/hadoop/conf/
$scp phsKeyStore2 phsTrustStore2 root@pract_hdp_sec2:/etc/hadoop/conf/
Validate the common TrustStore file:
$keytool -list -v -keystore phsTrustStore1 -storepass 4324324
Adjust Permissions for KeyStore/TrustStore Files
The Keystore files need to have read permissions for owner and group only, and the group should be set to hadoop.
The Truststore files should have read permissions for every one (owner, group, and others). The following commands
set this up:
$ssh root@pract_hdp_sec "cd /etc/hadoop/conf;chgrp hadoop phsKeyStore1;
chmod 0440phsKeyStore1;chmod 0444 phsTrustStore1
$ssh root@pract_hdp_sec2 "cd /etc/hadoop/conf;chgrp hadoop phsKeyStore2;
chmod 0440phsKeyStore2;chmod 0444 phsTrustStore2
If need be, you can generate public key certificates to install in your browser. This completes the setup of a
KeyStore and TrustStore for HTTP encryption.
182
APPENDIX D
Hadoop Metrics and Their Relevance
to Security
In Chapter 7’s “Hadoop Metrics” section, you reviewed what Hadoop metrics are, how you can apply filters to metrics,
and how you can direct them to a file or monitoring software such as Ganglia. As you will soon learn, you can use
these metrics for security, as well.
As you will remember, you can use Hadoop metrics to set alerts to capture sudden changes in system resources.
In addition, you can set up your Hadoop cluster to monitor NameNode resources and generate alerts when any
specified resources deviate from desired parameters. For example, I will show you how to generate alerts when
deviation for the following resources exceed the monthly average by 50% or more:
FilesCreated
FilesDeleted
Transactions_avg_time
GcCount
GcTimeMillis
LogFatal
MemHeapUsedM
ThreadsWaiting
First, I direct output of the NameNode metrics to a file. To do so, I add the following lines to the file
hadoop-metrics2.properties in the directory $HADOOP_INSTALL/hadoop/conf:
*.sink.tfile.class=org.apache.hadoop.metrics2.sink.FileSink
namenode.sink.tfile.filename = namenode-metrics.log
Next, I set filters to include only the necessary metrics:
*.source.filter.class=org.apache.hadoop.metrics2.filter.GlobFilter
*.record.filter.class=${*.source.filter.class}
*.metric.filter.class=${*.source.filter.class}
namenode.sink.file.metric.filter.include=FilesCreated
namenode.sink.file.metric.filter.include=FilesDeleted
namenode.sink.file.metric.filter.include=Transactions_avg_time
namenode.sink.file.metric.filter.include=GcCount
namenode.sink.file.metric.filter.include=GcTimeMillis
namenode.sink.file.metric.filter.include=LogFatal
namenode.sink.file.metric.filter.include=MemHeapUsedM
namenode.sink.file.metric.filter.include=ThreadsWaiting
183
APPENDIX D ■ HADOOP METRICS AND THEIR RELEVANCE TO SECURITY
My filtered list of metrics is now being written to the output file namenode-metrics.log.
Next, I develop a script to load this file daily to HDFS and add it to a Hive table as a new partition. I then
recompute the 30-day average, taking into account the new values as well, and compare the average values with the
newly loaded daily values.
If the deviation is more than 50% for any of these values, I can send a message to my Hadoop system
administrator with the name of the node and the metric that deviated. The system administrator can then check
appropriate logs to determine if there are any security breaches. For example, if the ThreadsWaiting metric is
deviating more than 50%, then the system administrator will need to check the audit logs to see who was accessing
the cluster and who was executing jobs at that time, and then check relevant jobs as indicated by the audit logs. For
example, a suspicious job may require a check of the JobTracker and appropriate TaskTracker logs.
Alternately, you can direct the outputs of these jobs to Ganglia and then use Nagios to generate alerts if any of the
metric values deviate.
Tables D-1 through D-4 list some commonly used Hadoop metrics. The JVM and RPC context metrics are listed
first, because they are generated by all Hadoop daemons.
Table D-1. JVM and RPC Context Metrics
Metric Group | Metric Name | Description |
JVM | GcCount | Number of garbage collections of the enterprise console JVM |
GcTimeMillis | Calculates the total time all garbage collections have taken | |
LogError | Number of log lines with Log4j level ERROR | |
LogFatal | Number of log lines with Log4j level FATAL | |
LogWarn | Number of log lines with Log4j level WARN | |
LogInfo | Number of log lines with Log4j level INFO | |
MemHeapCommittedM | Calculates the heap memory committed by the enterprise | |
MemHeapUsedM | Calculates the heap memory committed by the enterprise | |
ThreadsBlocked | Number of threads in a BLOCKED state, which means they | |
ThreadsWaiting | Number of threads in a WAITING state, which means they | |
ThreadsRunnable | Number of threads in a RUNNABLE state that are executing | |
ThreadsTerminated | Number of threads in a TERMINATED state, which means | |
ThreadsNew | Number of threads in a NEW state, which means they have |
(continued)
184
APPENDIX D ■ HADOOP METRICS AND THEIR RELEVANCE TO SECURITY
Table D-1. (continued)
Metric Group Metric Name Description
RPC ReceivedBytes Number of RPC received bytes
SentBytes Number of RPC sent bytes
RpcProcessingTimeAvgTime Average time for processing RPC requests
RpcProcessingTimeNumOps Number of processed RPC requests
RpcQueueTimeAvgTime Average time spent by an RPC request in the queue
RpcQueueTimeNumOps Number of RPC requests that were queued
RpcAuthorizationSuccesses Number of successful RPC authorization calls
RpcAuthorizationFailures Number of failed RPC authorization calls
RpcAuthenticationSuccesses Number of successful RPC authentication calls
RpcAuthenticationFailures Number of failed RPC authentication calls
Table D-2. NameNode and DataNode Metrics
Metric Group | Metric Name | Description |
Hadoop.HDFS.NameNode | AddBlockOps | Number of add block operations for a cluster |
CapacityRemaining | Total capacity remaining in HDFS | |
CapacityTotal | Total capacity remaining in HDFS and other distributed | |
CapacityUsed | Total capacity used in HDFS | |
CreateFileOps | Number of create file operations for a cluster | |
DeadNodes | Number of dead nodes that exist in a cluster | |
DecomNodes | Number of decommissioned nodes that exist in a cluster | |
DeleteFileOps | Number of “delete” file operations occurring in HDFS | |
FSState | State of the NameNode, which can be in safe mode or | |
FileInfoOps | Number of file access operations occurring in the cluster | |
FilesAppended | Number of files appended in a cluster | |
FilesCreated | Number of files created in a cluster | |
FilesDeleted | Number of files deleted in a cluster | |
FilesInGetListingOps | Number of get listing operations occurring in a cluster | |
FilesRenamed | Number of files renamed in a cluster | |
LiveNodes | Number of live nodes in a cluster | |
NonDfsUsedSpace | Calculates the non-HDFS space used in the cluster | |
PercentRemaining | Percentage of remaining HDFS capacity |
(continued)
185
APPENDIX D ■ HADOOP METRICS AND THEIR RELEVANCE TO SECURITY
Table D-2. (continued) | ||
Metric Group | Metric Name | Description |
PercentUsed | Percentage of used HDFS capacity | |
Safemode | Calculates the safe mode state: 1 indicates safe mode is on; | |
SafemodeTime | Displays the time spent by NameNode in safe mode | |
Syncs_avg_time | Average time for the sync operation | |
Syncs_num_ops | Number of sync operations | |
TotalBlocks | Total number of blocks in a cluster | |
TotalFiles | Total number of files in a cluster | |
Transactions_avg_time | Average time for a transaction | |
Transactions_num_ops | Number of transaction operations | |
UpgradeFinalized | Indicates if the upgrade is finalized as true or false | |
addBlock_avg_time | Average time to create a new block in a cluster | |
addBlock_num_ops | Number of operations to add data blocks in a cluster | |
blockReceived_avg_time | Average time to receive a block operation | |
blockReceived_num_ops | Number of block received operations | |
blockReport_num_ops | Number of block report operations | |
blockReport_avg_time | Average time for block report operation | |
TimeSinceLastCheckpoint | Calculates the amount of time since the last checkpoint | |
Hadoop.HDFS.DataNode | BlocksRead | Number of times that a block is read from the hard disk, |
BlocksRemoved | Number of removed or invalidated blocks on the DataNode | |
BlocksReplicated | Number of blocks transferred or replicated from | |
BlocksVerified | Number of block verifications, including successful or | |
BlocksWritten | Number of blocks written to disk | |
BytesRead | Number of bytes read when reading and copying a block | |
BytesWritten | Number of bytes written to disk in response to a | |
HeartbeatsAvgTime | Average time to send a heartbeat from DataNode | |
BlocksRemoved | Number of removed or invalidated blocks on the DataNode | |
BlocksReplicated | Number of blocks transferred or replicated from one | |
HeartbeatsNumOps | Number of heartbeat operations occurring in a cluster | |
186 |
APPENDIX D ■ HADOOP METRICS AND THEIR RELEVANCE TO SECURITY
Table D-3. MapReduce Metrics Generated by JobTracker
Metric Group Metric Name Description
Hadoop.Mapreduce.
Jobtracker
blacklisted_maps Number of blacklisted map slots in each TaskTracker
Heartbeats Total Number of JobTracker heartbeats
blacklisted_reduces Number of blacklisted reduce slots in each TaskTracker
callQueueLen Calculates the RPC call queue length
HeartbeatAvgTime Average time for a heartbeat
jobs_completed Number of completed job
jobs_failed Number of failed jobs
jobs_killed Number of killed jobs
jobs_running Number of running jobs
jobs_submitted Number of submitted jobs
maps_completed Number of completed maps
maps_failed Number of failed maps
maps_killed Number of killed maps
maps_launched Number of launched maps
memNonHeapCommittedM Non-heap committed memory (MB)
memNonHeapUsedM Non-heap used memory (MB)
occupied_map_slots Number of occupied map slots
map_slots Number of map slots
occupied_reduce_slots Number of occupied reduce slots
reduce_slots Number of reduce slots
reduces_completed Number of reducers completed
reduces_failed Number of failed reducers
reduces_killed Number of killed reduces
reduces_launched Number of launched reducers
reserved_map_slots Number of reserved map slots
reserved_reduce_slots Number of reserved reduce slots
running_0 Number of running jobs
running_1440 Number of jobs running for more than 24 hours
running_300 Number of jobs running for more than five hours
running_60 Number of jobs running for more than one hour
running_maps Number of running maps
(continued)
187
APPENDIX D ■ HADOOP METRICS AND THEIR RELEVANCE TO SECURITY
Table D-3. (continued)
Metric Group | Metric Name | Description |
running_reduces | Number of running reduces | |
Trackers | Number of TaskTrackers | |
trackers_blacklisted | Number of blacklisted TaskTrackers | |
trackers_decommissioned | Number of decommissioned TaskTrackers | |
trackers_graylisted | Number of gray-listed TaskTrackers | |
waiting_reduces | Number of waiting reduces | |
waiting_maps | Number of waiting maps |
Table D-4. HBase Metrics
Metric Group | Metric Name | Description |
hbase.master | MemHeapUsedM | Heap memory used in MB |
MemHeapCommittedM | Heap memory committed in MB | |
averageLoad | Average number of regions served by each region server | |
numDeadRegionServers | Number of dead region servers | |
numRegionServers | Number of online region servers | |
ritCount | Number of regions in transition | |
ritCountOverThreshold | Number of regions in transition that exceed the threshold as defined by the property rit.metrics.threshold.time | |
clusterRequests | Total number of requests from all region servers to a cluster | |
HlogSplitTime_mean | Average time to split the total size of a write-ahead log file | |
HlogSplitTime_min | Minimum time to split the total size of a write-ahead log file | |
HlogSplitTime_max | Maximum time to split the write-ahead log file after a restart | |
HlogSplitTime_num_ops | Time to split write-ahead log files | |
HlogSplitSize_mean | Average time to split the total size of an Hlog file | |
HlogSplitSize_min | Minimum time to split the total size of an Hlog file | |
HlogSplitSize_max | Maximum time to split the total size of an Hlog file | |
HlogSplitSize_num_ops | Size of write-ahead log files that were split | |
hbase.regionserver | appendCount | Number of WAL appends |
blockCacheCount | Number of StoreFiles cached in the block cache | |
blockCacheEvictionCount | Total Number of blocks that have been evicted from the | |
blockCacheFreeSize | Number of bytes that are free in the block cache | |
(continued) | ||
188 |
APPENDIX D ■ HADOOP METRICS AND THEIR RELEVANCE TO SECURITY
Table D-4. (continued)
Metric Group Metric Name Description
blockCacheExpressHitPercent Calculates the block cache hit percent for requests where
caching was turned on
blockCacheHitCount Total number of block cache hits for requests, regardless of
caching setting
blockCountHitPercent Block cache hit percent for all requests regardless of the
caching setting
blockCacheMissCount Total Number of block cache misses for requests,
regardless of caching setting
blockCacheSize Number of bytes used by cached blocks
compactionQueueLength Number of HRegions on the CompactionQueue. These
regions call compact on all stores, and then find out if a
compaction is needed along with the type of compaction.
MemMaxM Calculates the total heap memory used in MB
MemHeapUsedM Calculates the heap memory used in MB
MemHeapCommittedM Calculates the heap memory committed in MB
GcCount Number of total garbage collections
updatesBlockedTime Number of memstore updates that have been blocked so
that memstore can be flushed
memstoreSize Calculates the size of all memstores in all regions in MB
readRequestCount Number of region server read requests
regionCount Number of online regions served by a region server
slowAppendCount Number of appends that took more than 1000 ms to complete
slowGetCount Number of gets that took more than 1000 ms to complete
slowPutCount Number of puts that took more than 1000 ms to complete
slowIncrementCount Number of increments that took more than 1000 ms
to complete
slowDeleteCount Number of deletes that took more than 1000 ms to complete
storeFileIndexSize Calculates the size of all StoreFile indexes in MB. These are
not necessarily in memory because they are stored in the
block cache as well and might have been evicted.
storeFileCount Number of StoreFiles in all stores and regions
storeCount Number of stores in all Regions
staticBloomSize Calculates the total size of all bloom filters, which are not
necessarily loaded in memory
writeRequestCount Number of write requests to a region server
staticIndexSize Calculates the total static index size for all region
server entities
189
Index
A
Access Control List (ACL), 41
Activity statistics
DataNode, 124
NameNode, 124
RPC-related processing, 125
sudden system resources change, 125
Advanced encryption standard (AES) algorithms, 148
Amazon Web Services
EMR cluster, 164
Envelope encryption, 160
Identity and Access Management creation, 162
key management infrastructure, 159
key pair creation, 162
management console, 159–160
S3 bucket creation, 160–161
security credentials, 163
Amazon Web Services (AWS), 154
B
Block ciphers, 152
Burrows–Abadi–Needham (BAN) logic, 9
C
check_long_running_procs.sh, 141
check_ssh_faillogin, 141
Commercial-grade encryption algorithms, 146
Cross-authentication, 60
Cryptography, 145
D
Data encryption standard (DES) algorithm, 147
Dfs metrics, 121–123
Digital signature, 152
Digital Signature Algorithm (DSA), 151
Distributed system, 12
authentication, 13
authorization, 14
encryption
SQL Server security layers, 15–16
symmetric keys/certificates, 14
TDE, 15
ERP, 12
monitoring, 119–120
SQL Server secures data, 13
E
Encryption, 145
algorithms
AES, 148
asymmetric, 147
DES, 147
DSA, 151
DSA vs. RSA, 151
RSA, 150
symmetric algorithm, 146
Amazon Web Services
EMR Cluster, 164
Envelope encryption, 160
Identity and Access Management
creation, 162
key management infrastructure, 159
key pair creation, 162
management console, 159–160
S3 bucket creation, 160–161
security credentials, 163
test encryption, 166
applications
digital signature and certificates, 152
hash functions, 151
key exchange, 152
data at rest, 153
definition, 145
191
■ INDEX
Encryption (cont.)
Hadoop distribution, 154
KeyStore, 155
special classes, 158
step-by-step implementation, 155
TrustStore, 155
principles, 145
F
Fine-grained authorization, 75
access permissions, 78
Hadoop environment
system analysis, 76
ticket data details, 77
security model implementation
extending ticket data, 81
HDFS permission model, 82
ticket data storage, 79
users and groups, 80
G
Ganglia, 127
architecture
gmetad component, 129
gmond component, 128
gweb component, 129
RRDtool component, 129
configuration and use of, 129
dashboard, 134
HBase monitoring, 133
H
Hadoop architecture
Apache Hadoop YARN, 28
DataNodes, 20
HA NameNode, 24
HDFS (see Hadoop Distributed File System (HDFS))
MapReduce framework
and job processing, 27
aspect of, 26
input key-value, 27–28
JobTracker, 26–27
phases, 26
security issues, 29
task attempt, 26
Hadoop daemon, 121
Hadoop Distributed File System (HDFS), 31
add/remove DataNodes, 22
cluster rebalancing, 22
definition, 20
disk storage, 22
file storage and replication system, 21
NameNode, 20–21
Secondary NameNode, 23
security issues, 29
client/server model, 25
communication protocols and vulnerabilities, 26
data provenance, 31
data-transfer protocol, 25
enterprise security, 30
existing user credentials and policies, 30
rest, data encryption, 30–31
threats, 25
unencrypted data,transit, 30
Hadoop logs, 97
analytics, 116
audit logs, 107
correlation, 109
grep command, 112
HDFS audit log, 109
Hive logs, 110
investigators, 109
MapReduce audit log, 110–111
sed command, 112
to retrieve records, 113
using browser interface, 113
using job names, 111
daemon logs, 107
splunk, 116
time synchronization, 116
Hadoop metrics, 121
activity statistics
DataNode, 124
NameNode, 124
RPC-related processing, 125
sudden system resources change, 125
data filtering, 125
dfs, 121–123
Hadoop daemon, 121
jvm, 121–122
mapred, 121–122, 124
Metrics2, 122
rpc, 121–123
to output files, 126
Hadoop monitoring, 119
distributed system, 119–120
Ganglia (see Ganglia)
Nagios (see Nagios)
simple monitoring system, 120
Hadoop security, 37
data encryption
in transit, 45
rest, 46
HDFS (see HDFS)
issues, 38
Hadoop Stack, 31
common libraries/utilities, 31
components, 32
core modules, 32
192
INDEX
HDFS, 31
MapReduce, 32
YARN, 31
Hash functions, 151
HBase monitoring, with Ganglia, 133
HDFS
authorization
ACL, 41
claims data, 39
file permissions, 40
groups, 41
portable operating system interface, 39
process, 39
daemons, 41
Ganglia monitoring system, 44
Kerberos, 38
monitoring, 43
Nagios, 45
security issue
business cases, 42
HIPAA, 42
Log4j module, 42–43
Health Information Portability and
Accountability Act (HIPAA), 42
HTTP protocol
certificates, 72
core-site.xml properties, 73
data transfer, 74
shume tra±c, 72
SSL properties, 73
usage, 71
I
Identity and Access Management (IAM) console, 162
J
Java Cryptography Extension (JCE), 62
Jvm metrics, 121–122
K
Kerberos, 8
architecture, 58
database
creation, 62–63
definition, 60
definition, 58
Hadoop, implementation, 65
core-site.xml, 66
DataNode log file, 71
hdfs-site.xml configuration file, 66–68
mapred principals, 68
map service principals, 65
NameNode, 70
TaskController class, 69
YARN containerexecutor.cfg, 69–70
YARN principals, 69
installation and configuration, 60
key facts, 58
Keytab files, 63
principal, 60
realms, 60
service principals, 63
TGT, 59
tickets, 60
usage, 58
Key management infrastructure (KMI), 159
L
Lightweight Directory Access
Protocol (LDAP), 30
Local monitoring data, 121
Log4j API
appenders, 102
additivity, 102
HDFS audit, 103
Filters, 105
flexibility, 98
framework, 99
layout, 103
DateLayout, 105
HTMLLayout, 105
PatternLayout, 104–105
Simple Layout, 104
TTCCLayout, 104
XMLLayout, 105
loggers, 99
HDFS audit, 101
inheritance, 100
logging levels, 99
reliability, 97
speed, 98
time-consuming process, 98
M
Malicious flaws
logic trigger, 12
prevent infections, 12
rabbit, 12
trap door, 12
Trojan horse, 12
virus, 11
worm, 11
Mapred metrics, 121–122, 124
Metrics. See Hadoop metrics
Metrics2 system, 122
Monitoring. See Ganglia; Hadoop monitoring; Nagios
Mutual authentication, 73
193
INDEX
N
Nagios, 127, 134
architecture, 135
commands and macros, 138
integration with Ganglia, 136
plug-ins, 136, 140
user community, 141
web interface, 140
Needham–Schroeder Symmetric Key Protocol, 7
Non-malicious flaws
buGer, 10
Incomplete mediation, 10
Time-of-Check to Time-of-Use errors, 11
O
Open source authentication, 51
client-server communications
HTTP protocol (see HTTP protocol)
Inter-process communication, 72
remote procedure call, 71
TaskTracker, 71
TCP/IP protocol, 71
Kerberos (see Kerberos)
passphrases, 56
PuTTY (see PuTTY)
security puzzle, 51–52
P, Q
Program
definition, 9
failure, 9
fault, 9
malicious flaws (see Malicious flaws)
non-malicious flaws (see Non-malicious flaws)
Public key cryptography, 147
PuTTY
advantage, 52
host key, 53
key-based authentication
authorized_keys file, 56
definition, 53
Generate button, 55
private key, 54
RSA key, 54
SSH, 53
spoofing, 53
R
Remote procedure call (RPC), 24, 71
Rijndael, AES, 148
Rivest-Shamir-Adelman (RSA) algorithm, 150
Role-based authorization, 85
194
configuration changes, 88
design roles, 90
design rules, 89
design tables, 88
users and groups, 90
HDFS file permissions, 88
Hive architecture, 85
Kerberos authentication, 87
permission details, 87
Sentry architecture, 86
rules and roles, 86
users and groups, 86
ticketing system, 87
Rpc metrics, 121–123
S
Sarbanes-Oxley Act (SOX), 42
Secret keys, 146
Security engineering
BAN logic, 9
definition, 3
framework
implementation, 4–5
motivation, 4–5
relevance, 4–5
reliability, 4–5
strategy, 4–5
Kerberos, 8
Needham–Schroeder Symmetric Key Protocol, 7
protocols, 7
psychological aspects of
client certificates/custom-built applications, 6
password scramblers, 5
pretexting, 5
strong password protocols, 7
trusted computing, 6
two-channel authentication, 7
two-phase authentication, 6
requirement, 3
Security monitoring system. See Ganglia; Nagios
show_users, 141
Simple Authentication and Security Layer (SASL), 72
Software development life cycle (SDLC), 9
SQL injection, 10
T, U, V
mread-Time-Category-Context Layout, 104
Ticket Granting Service (TGS), 38
Ticket Granting Ticket (TGT), 38, 59
Transparent Database Encryption (TDE), 15
W, X,Y, Z
Wrapper, 74
Practical Hadoop
Security
Bhushan Lakhe
Practical Hadoop Security
Copyright © 2014 by Bhushan Lakhe
mis work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material
is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval,
electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed.
Exemptedfrom this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material
supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the
purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the
Copyright Law of the Publisher’s location, in its current version, and permission for use must always be obtained from
Springer. Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are
liable to prosecution under the respective Copyright Law.
ISBN-13 (pbk): 978-1-4302-6544-3
ISBN-13 (electronic): 978-1-4302-6545-0
Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every
occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion
and to the benefit of the trademark owner, with no intention of infringement of the trademark.
me use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified
as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.
While the advice and information in this book are believed to be true and accurate at the date of publication, neither
the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may
be made. me publisher makes no warranty, express or implied, with respect to the material contained herein.
Managing Director: Welmoed Spahr
Acquisitions Editor: Robert Hutchinson
Developmental Editor: Linda Laflamme
Technical Reviewer: Robert L. Geiger
Editorial Board: Steve Anglin, Mark Beckner, Gary Cornell, Louise Corrigan, James DeWolf, Jonathan Gennick,
Robert Hutchinson, Michelle Lowman, James Markham, Matthew Moodie, JeG Olson, JeGrey Pepper,
Douglas Pundick, Ben Renow-Clarke, Gwenan Spearing, Matt Wade, Steve Weiss
Coordinating Editor: Rita Fernando
Copy Editor: James Fraleigh
Compositor: SPi Global
Indexer: SPi Global
Cover Designer: Anna Ishchenko
Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street, 6th Floor,
New York, NY 10013. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail orders-ny@springer-sbm.com, or visit
www.springeronline.com. Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science +
Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation.
For information on translations, please e-mail rights@apress.com, or visit www.apress.com.
Any source code or other supplementary material referenced by the author in this text is available to readers at
www.apress.com. For detailed information about how to locate your book’s source code, go to
www.apress.com/source-code/.
To my beloved father . . . you will always be a part of me!
Contents
About the Author xiii
About the Technical Reviewer xv
Part I: Introducing Hadoop and Its Security 1
Chapter 1: Understanding Security Concepts 3
Introducing Security Engineering 3
Security Engineering Framework 4
Psychological Aspects of Security Engineering 5
Introduction to Security Protocols 7
Securing a Program 9
Non-Malicious Flaws 10
Malicious Flaws 11
Securing a Distributed System 12
Authentication 13
Authorization 14
Encryption 14
Summary 17
Chapter 2: Introducing Hadoop 19
Hadoop Architecture 19
HDFS 20
Inherent Security Issues with HDFS Architecture 25
vii
CONTENTS
Hadoop’s Job Framework using MapReduce 26
Inherent Security Issues with Hadoop’s Job Framework 29
Hadoop’s Operational Security Woes 29
The Hadoop Stack 31
Main Hadoop Components 32
Summary 35
Chapter 3: Introducing Hadoop Security 37
Starting with Hadoop Security 37
Introducing Authentication and Authorization for HDFS 38
Authorization 38
Real-World Example for Designing Hadoop Authorization 39
Fine-Grained Authorization for Hadoop 41
Securely Administering HDFS 41
Using Hadoop Logging for Security 42
Monitoring for Security 43
Tools of the Trade 43
Encryption: Relevance and Implementation for Hadoop 45
Encryption for Data in Transit 45
Encryption for Data at Rest 46
Summary 47
Part II: Authenticating and Authorizing Within Your Hadoop Cluster 49
Chapter 4: Open Source Authentication in Hadoop 51
Pieces of the Security Puzzle 51
Establishing Secure Client Access 52
Countering Spoofing with PuTTY’s Host Keys 53
Key-Based Authentication Using PuTTY 53
Using Passphrases 56
viii
CONTENTS
Building Secure User Authentication 58
Kerberos Overview 58
Installing and Configuring Kerberos 60
Preparing for Kerberos Implementation 62
Implementing Kerberos for Hadoop 65
Securing Client-Server Communications 71
Safe Inter-process Communication 72
Encrypting HTTP Communication 72
Securing Data Communication 74
Summary 74
Chapter 5: Implementing Granular Authorization 75
Designing User Authorization 75
Call the Cops: A Real-World Security Example 76
Determine Access Groups and their Access Levels 78
Implement the Security Model 79
Access Control Lists for HDFS 82
Role-Based Authorization with Apache Sentry 85
Hive Architecture and Authorization Issues 85
Sentry Architecture 86
Implementing Roles 87
Summary 93
Part III: Audit Logging and Security Monitoring 95
Chapter 6: Hadoop Logs: Relating and Interpretation 97
Using Log4j API 97
Loggers 99
Appenders 102
Layout 103
Filters 105
ix
CONTENTS
Reviewing Hadoop Audit Logs and Daemon Logs 106
Audit Logs 106
Hadoop Daemon Logs 107
Correlating and Interpreting Log Files 108
What to Correlate? 109
How to Correlate Using Job Name? 111
Important Considerations for Logging 115
Time Synchronization 116
Hadoop Analytics 116
Splunk 116
Summary 117
Chapter 7: Monitoring in Hadoop 119
Overview of a Monitoring System 119
Simple Monitoring System 120
Monitoring System for Hadoop 120
Hadoop Metrics 121
The jvm Context 122
The dfs Context 123
The rpc Context 123
The mapred Context 124
Metrics and Security 124
Metrics Filtering 125
Capturing Metrics Output to File 126
Security Monitoring with Ganglia and Nagios 127
Ganglia 127
Monitoring HBase Using Ganglia 133
Nagios 134
Nagios Integration with Ganglia 136
The Nagios Community 141
Summary 141
x
CONTENTS
Part IV: Encryption for Hadoop 143
Chapter 8: Encryption in Hadoop 145
Introduction to Data Encryption 145
Popular Encryption Algorithms 146
Applications of Encryption 151
Hadoop Encryption Options Overview 153
Encryption Using Intel’s Hadoop Distro 154
Step-by-Step Implementation 155
Special Classes Used by Intel Distro 158
Using Amazon Web Services to Encrypt Your Data 159
Deciding on a Model for Data Encryption and Storage 159
Encrypting a Data File Using Selected Model 160
Summary 168
Part V: Appendices 169
Appendix A: Pageant Use and Implementation 171
Using Pageant 171
Security Considerations 176
Appendix B: PuTTY and SSH Implementation for Linux-Based Clients 177
Using SSH for Remote Access 179
Appendix C: Setting Up a KeyStore and TrustStore for HTTP Encryption 181
Create HTTPS Certificates and KeyStore/TrustStore Files 181
Adjust Permissions for KeyStore/TrustStore Files 182
Appendix D: Hadoop Metrics and Their Relevance to Security 183
Index 191
xi
About the Author

Bhushan Lakhe is a database professional, technology evangelist, and avid blogger
residing in the windy city of Chicago. After graduating in 1988 from one of India’s
leading universities (Birla Institute of Technology & Science, Pilani), he started his
career with India’s biggest software house, Tata Consultancy Services. Soon sent to
the UK on a database assignment, he joined ICL, a British computer company, and
worked with prestigious British clients on database assignments. Moving to Chicago
in 1995, he worked as a consultant with such Fortune 50 companies as Leo Burnett,
Blue Cross and Blue Shield of Illinois, CNA Insurance, ABN AMRO Bank, Abbott
Laboratories, Motorola, JPMorgan Chase, and British Petroleum, often in a critical
and pioneering role.
After a seven-year stint executing successful Big Data (as well as Data
Warehouse) projects for IBM’s clients (and receiving the company’s prestigious
Gerstner Award in 2012), Mr. Lakhe spent two years helping Unisys Corporation’s
clients with Data Warehouse and Big Data implementations. Mr. Lakhe is currently
working as Senior Vice President of Information and Data Architecture at Ipsos,
the world’s third largest market research corporation, and is responsible for the company’s Global Data Architecture
and Big Data strategy. Mr. Lakhe is active in the Chicago Hadoop community and regularly answers queries on
various Hadoop user forums. You can find Mr. Lakhe on LinkedIn at https://www.linkedin.com/pub/bhushan-
lakhe/0/455/475.
xiii
About the Technical Reviewer

Robert L. Geiger leads product strategy for Data Driven Business Solutions and
Hybrid-Transactional Processing at TransLattice, Inc. Previously, he was an
architect and team lead at Pivotal, working in the areas of Big Data system security,
as well as Hadoop and Big Data as a service. Mr. Geiger served as Vice President of
Engineering for Mu Dynamics (formerly Mu Security). Previously, he was Senior
Director of Engineering at Symantec where he led the engineering team building
Symantec’s award-winning SNS Intrusion Protection product, after the acquisition
of Recourse Technologies. Mr. Geiger spent 10 years at Motorola Labs working
on electromagnetic systems modeling using massively parallel supercomputers,
wireless data systems development, mobile security software, and e-commerce
solutions. He holds several patents in the areas of mobile data, wireless security,
and e-commerce. Mr Geiger has a Masters of Electrical Engineering degree from the University of Illinois, Urbana,
and a Bachelor of Science degree in Electrical Engineering from the State University of New York.
xv
While writing this book, I received help (both directly and indirectly) from a lot of people, and I would like to thank
them all. Thanks to the Hadoop community and the user forums, from whom I have learned a great deal. There are
many selfless people in the Hadoop community, and I feel that’s the biggest strength of the “Hadoop revolution.” In
particular, thanks to my friend Satya Kondapalli for introducing Hadoop to me, and to my friends Naveed Asem and
Zafar Ahmed for keeping me motivated!
I would like to thank the Intel Technical team (Bala Subramanian, Sunjay Karan, Manjunatha Prabhu) in
Chicago for their help with Intel’s Hadoop distribution and encryption at rest, as well as Cloudera’s Justin Kestelyn for
providing information about Sentry. Last, I would like to thank my friend Milind Bhandarkar (of Pivotal) for his help
and support.
I am grateful to my editors, Linda Laflamme and Rita Fernando, at Apress for their help in getting this book
together. Linda and Rita have been there throughout to answer any questions that I have, to read and improve my
first (and second and third . . .) drafts, and to keep me on schedule! I am also very thankful to Robert Geiger for taking
time to review my book technically. Bob always had great suggestions for improving a topic, recommended additional
details, and, of course, resolved technical shortcomings!
Last, writing this book has been a lot of work, and I couldn’t have done it without the constant support from my
family. My wife, Swati, and my kids, Anish and Riya, have been very understanding. I’m looking forward to spending
lots more time with all of them!
xvii